Dataset of sentiment tagged language resources for Bosnian language

Data Brief. 2024 Feb 28:53:110247. doi: 10.1016/j.dib.2024.110247. eCollection 2024 Apr.

Abstract

The Bosnian language holds significant importance as a member of the West-South Slavic subgroup within the Slavic branch of the Indo-European linguistic family. With approximately 2.5 million speakers in Europe, including 1.87 million individuals in Bosnia and Herzegovina alone, the Bosnian language constitutes the mother tongue for a considerable portion of the population. In Natural Language Processing (NLP) tasks related to the Bosnian language, besides removing stop words, it is important to consider the influence of other linguistic elements. Bosnian text contains words derived from diminishers, relative intensifiers, minimizers, maximizers, boosters, and approximators. These words contribute to the overall meaning and sentiment analysis of the text. By including these elements in NLP models and algorithms, researchers can achieve more accurate and nuanced analysis of Bosnian language data, enhancing the effectiveness of NLP applications. The two lists of sentiment annotated words that present the core of the Bosnian sentiment-annotated lexicon, a list of the stopwords, and a list of Affirmative and non-Affrimative words (AnAwords) composed mostly of intensifiers and diminishers, were used to construct a dataset that presents the base for sentiment analysis in the Bosnian language.

Keywords: AnAwords; Diminishers; Intensifiers; Lexicon; Sense; Stopwords.