Developing named entity recognition algorithms for Uzbek: Dataset insights and implementation

Davlatyor Mengliev; Vladimir Barakhnin; Nilufar Abdurakhmonova; Mukhriddin Eshkulov

doi:10.1016/j.dib.2024.110413

Developing named entity recognition algorithms for Uzbek: Dataset insights and implementation

Data Brief. 2024 Apr 16:54:110413. doi: 10.1016/j.dib.2024.110413. eCollection 2024 Jun.

Authors

Davlatyor Mengliev^{1

2}, Vladimir Barakhnin^{1

2

3}, Nilufar Abdurakhmonova⁴, Mukhriddin Eshkulov⁵

Affiliations

¹ Urgench branch of Tashkent University of Information Technologies named after Muhammad al-Khwarizmi, 110, al-Khwarizmi str., 220100, Urgench city, Uzbekistan.
² Novosibirsk State University, 2, Pirogova str., Novosibirsk city, 630090, Russia.
³ Federal Research Center for Information and Computational Technologies, 6, Academician M.A. Lavrentiev avenue, Novosibirsk, 630090, Russia.
⁴ National University of Uzbekistan named after Mirzo-Ulugbek, 4, Universitet str., Olmazor distr., Tashkent city, 100174, Uzbekistan.
⁵ Jizzakh polytechnic institute, 4, Islom Karimov str., Jizzakh city, 130100, Uzbekistan.

Abstract

This paper presents a dataset and approaches to named entity recognition (NLP) in Uzbek language, in a resource-constrained language environment. Despite the increase in NLP applications, the Uzbek language is still underrepresented, which underscores the importance of our work. Our dataset includes 1,160 sentences with nearly 19,000 word forms annotated for parts of speech and named entities, making it a valuable resource for linguistic research and machine learning applications in Uzbek. In addition, for practical application and experiments, the authors have developed two algorithms that, using this dictionary, identifies named entities in Uzbek language texts. In addition, the authors described the methodology for creating the dataset, the design of the algorithms, and their application to the Uzbek language. This study not only provides an important dataset for future named entity recognition(NER) tasks in the Uzbek language, but also offers a methodological basis for the use of vocabulary-based NER or Machine learning NER in other low-resource languages (e.g. Karakalpak). The dataset (and algorithms) we have developed can be used to create applications such as improved chatbot systems, text mining applications and other analytical tools for the Uzbek language, contributing to the development of those areas in the region for which these solutions will be developed.

Keywords: Language corpus; Linguistic research; Low-resource languages; Named entity; Uzbek language.