Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets

PeerJ Comput Sci. 2023 Jun 22:9:e1312. doi: 10.7717/peerj-cs.1312. eCollection 2023.

Abstract

With the massive use of social media today, mixing between languages in social media text is prevalent. In linguistics, the phenomenon of mixing languages is known as code-mixing. The prevalence of code-mixing exposes various concerns and challenges in natural language processing (NLP), including language identification (LID) tasks. This study presents a word-level language identification model for code-mixed Indonesian, Javanese, and English tweets. First, we introduce a code-mixed corpus for Indonesian-Javanese-English language identification (IJELID). To ensure reliable dataset annotation, we provide full details of the data collection and annotation standards construction procedures. Some challenges encountered during corpus creation are also discussed in this paper. Then, we investigate several strategies for developing code-mixed language identification models, such as fine-tuning BERT, BLSTM-based, and CRF. Our results show that fine-tuned IndoBERTweet models can identify languages better than the other techniques. This is the result of BERT's ability to understand each word's context from the given text sequence. Finally, we show that sub-word language representation in BERT models can provide a reliable model for identifying languages in code-mixed texts.

Keywords: BERT; Code-mixing; English; Indonesian; Javanese; Language identification; Twitter.

Grants and funding

This work is supported by Universiti Brunei Darussalam (Grant no. UBD/RSCH/1.18/FICBF (a)/2023/007). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.