A versatile framework for resource-limited sentiment articulation, annotation, and analysis of short texts

PLoS One. 2020 Nov 12;15(11):e0242050. doi: 10.1371/journal.pone.0242050. eCollection 2020.

Abstract

Choosing a comprehensive and cost-effective way of articulating and annotating the sentiment of a text is not a trivial task, particularly when dealing with short texts, in which sentiment can be expressed through a wide variety of linguistic and rhetorical phenomena. This problem is especially conspicuous in resource-limited settings and languages, where design options are restricted either in terms of manpower and financial means required to produce appropriate sentiment analysis resources, or in terms of available language tools, or both. In this paper, we present a versatile approach to addressing this issue, based on multiple interpretations of sentiment labels that encode information regarding the polarity, subjectivity, and ambiguity of a text, as well as the presence of sarcasm or a mixture of sentiments. We demonstrate its use on Serbian, a resource-limited language, via the creation of a main sentiment analysis dataset focused on movie comments, and two smaller datasets belonging to the movie and book domains. In addition to measuring the quality of the annotation process, we propose a novel metric to validate its cost-effectiveness. Finally, the practicality of our approach is further validated by training, evaluating, and determining the optimal configurations of several different kinds of machine-learning models on a range of sentiment classification tasks using the produced dataset.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Humans
  • Linguistics / methods*
  • Machine Learning
  • Natural Language Processing
  • Serbia
  • Social Media

Grants and funding

This work was partially funded by the Ministry of Education, Science, and Technological Development of the Republic of Serbia, under project III 44009 (http://www.mpn.gov.rs/). The sentiment annotation process was supported by the Regional Linguistic Data Initiative (ReLDI) project via the Swiss National Science Foundation grant no. 160501 (http://www.snf.ch/). This research was also supported by the Science Fund of the Republic of Serbia, grant no. 6526093, AI – AVANTES (http://fondzanauku.gov.rs/). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.