DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text

Bharathi Raja Chakravarthi; Ruba Priyadharshini; Vigneshwaran Muralidaran; Navya Jose; Shardul Suryawanshi; Elizabeth Sherly; John P McCrae

doi:10.1007/s10579-022-09583-7

DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text

Lang Resour Eval. 2022;56(3):765-806. doi: 10.1007/s10579-022-09583-7. Epub 2022 Feb 4.

Authors

Bharathi Raja Chakravarthi¹, Ruba Priyadharshini², Vigneshwaran Muralidaran³, Navya Jose⁴, Shardul Suryawanshi¹, Elizabeth Sherly⁴, John P McCrae¹

Affiliations

¹ Insight SFI Research Centre for Data Analytics, Data Science Institute, National University of Ireland Galway, Galway, Ireland.
² ULTRA Arts and Science College, Madurai, Tamil Nadu India.
³ School of Computer Science and Informatics, Cardiff University, Cardiff, UK.
⁴ Indian Institute of Information Technology and Management-Kerala, Kazhakkoottam, Kerala India.

Abstract

This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff's alpha. The dataset contains all types of code-mixing phenomena since it comprises user-generated content from a multilingual country. We also present baseline experiments to establish benchmarks on the dataset using machine learning and deep learning methods. The dataset is available on Github and Zenodo.

Keywords: Code-mixed; Corpora; Dravidian languages; Kannada; Malayalam; Offensive language identification; Sentiment analysis; Tamil.