Dataset of Karakalpak language stop words

Khabibulla Madatov; Shukurla Bekchanov; Jernej Vičič

doi:10.1016/j.dib.2023.109111

Dataset of Karakalpak language stop words

Data Brief. 2023 Apr 5:48:109111. doi: 10.1016/j.dib.2023.109111. eCollection 2023 Jun.

Authors

Khabibulla Madatov¹, Shukurla Bekchanov¹, Jernej Vičič^{2

3}

Affiliations

¹ Urgench State University, 14, Kh. Alimdjan str, Urgench City 220100, Uzbekistan.
² University of Primorska, FAMNIT, Glagoljaska 8, Koper 6000, Slovenia.
³ Research Centre of the Slovenian Academy of Sciences and Arts, The Fran Ramovš Institute, Novi trg 2, Ljubljana 1000, Slovenija.

Abstract

The dataset presented in this paper aims to address the challenge of automatic extraction of stop words in Natural Language Processing (NLP) for the low-resource Karakalpak language spoken by approximately two million people in Uzbekistan. To accomplish this, we have created a corpus of 23 Karakalpak language school textbooks, which we have named the Karakalpak Language School Corpus (KAASC). Using the KAASC corpus, we have constructed lists of stop words using three methods based on Term Frequency-Inverse Document Frequency (TF-IDF): unigram, bigram, and collocation methods, respectively. The resulting lists of stop words, along with a list of URLs used to construct the corpus, make up the described dataset in this paper.

Keywords: Bigram; Collocation; Karakalpak language; Machine learning; Stop words; Unigram.