Dataset of sentiment tagged language resources for Bosnian language

Sead Jahić; Jernej Vičič

doi:10.1016/j.dib.2024.110247

Dataset of sentiment tagged language resources for Bosnian language

Data Brief. 2024 Feb 28:53:110247. doi: 10.1016/j.dib.2024.110247. eCollection 2024 Apr.

Authors

Sead Jahić¹, Jernej Vičič^{1

2}

Affiliations

¹ University of Primorska, FAMNIT, Glagoljaska 8, 6000 Koper, Slovenia.
² Research Centre of the Slovenian Academy of Sciences and Arts, The Fran Ramovž Institute, Novi trg 2, 1000, Ljubljana, Slovenia.

Abstract

The Bosnian language holds significant importance as a member of the West-South Slavic subgroup within the Slavic branch of the Indo-European linguistic family. With approximately 2.5 million speakers in Europe, including 1.87 million individuals in Bosnia and Herzegovina alone, the Bosnian language constitutes the mother tongue for a considerable portion of the population. In Natural Language Processing (NLP) tasks related to the Bosnian language, besides removing stop words, it is important to consider the influence of other linguistic elements. Bosnian text contains words derived from diminishers, relative intensifiers, minimizers, maximizers, boosters, and approximators. These words contribute to the overall meaning and sentiment analysis of the text. By including these elements in NLP models and algorithms, researchers can achieve more accurate and nuanced analysis of Bosnian language data, enhancing the effectiveness of NLP applications. The two lists of sentiment annotated words that present the core of the Bosnian sentiment-annotated lexicon, a list of the stopwords, and a list of Affirmative and non-Affrimative words (AnAwords) composed mostly of intensifiers and diminishers, were used to construct a dataset that presents the base for sentiment analysis in the Bosnian language.

Keywords: AnAwords; Diminishers; Intensifiers; Lexicon; Sense; Stopwords.