A deep learning approach for Named Entity Recognition in Urdu language

Rimsha Anam; Muhammad Waqas Anwar; Muhammad Hasan Jamal; Usama Ijaz Bajwa; Isabel de la Torre Diez; Eduardo Silva Alvarado; Emmanuel Soriano Flores; Imran Ashraf

doi:10.1371/journal.pone.0300725

A deep learning approach for Named Entity Recognition in Urdu language

PLoS One. 2024 Mar 28;19(3):e0300725. doi: 10.1371/journal.pone.0300725. eCollection 2024.

Authors

Affiliations

¹ Department of Computer Science, COMSATS University Islamabad, Lahore, Pakistan.
² Department of Computer Science, Government College University, Lahore, Pakistan.
³ Department of Signal Theory, Communications and Telematics Engineering, Unviersity of Valladolid, Valladolid - Spain.
⁴ Universidad Europea del Atlántico, Santander, Spain.
⁵ Universidad Internacional Iberoamericana Arecibo, Puerto Rico, Puerto Rico, United States of America.
⁶ Universidade Internacional do Cuanza, Cuito, Bié, Angola.
⁷ Universidad Internacional Iberoamericana Campeche, México.
⁸ Fundación Universitaria Internacional de Colombia Bogotá, Bogotá, Colombia.
⁹ Department of Information and Communication Engineering, Yeungnam University, Gyeongsan, Korea.

Abstract

Named Entity Recognition (NER) is a natural language processing task that has been widely explored for different languages in the recent decade but is still an under-researched area for the Urdu language due to its rich morphology and language complexities. Existing state-of-the-art studies on Urdu NER use various deep-learning approaches through automatic feature selection using word embeddings. This paper presents a deep learning approach for Urdu NER that harnesses FastText and Floret word embeddings to capture the contextual information of words by considering the surrounding context of words for improved feature extraction. The pre-trained FastText and Floret word embeddings are publicly available for Urdu language which are utilized to generate feature vectors of four benchmark Urdu language datasets. These features are then used as input to train various combinations of Long Short-Term Memory (LSTM), Bidirectional LSTM (BiLSTM), Gated Recurrent Unit (GRU), CRF, and deep learning models. The results show that our proposed approach significantly outperforms existing state-of-the-art studies on Urdu NER, achieving an F-score of up to 0.98 when using BiLSTM+GRU with Floret embeddings. Error analysis shows a low classification error rate ranging from 1.24% to 3.63% across various datasets showing the robustness of the proposed approach. The performance comparison shows that the proposed approach significantly outperforms similar existing studies.

Copyright: © 2024 Anam et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

MeSH terms

Benchmarking
Deep Learning*
Language
Names*
Natural Language Processing

Grants and funding

This research was supported by the European University of the Atlantic. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.