Text data augmentation and pre-trained Language Model for enhancing text classification of low-resource languages

PeerJ Comput Sci. 2024 Mar 29:10:e1974. doi: 10.7717/peerj-cs.1974. eCollection 2024.

Abstract

Background: In the domain of natural language processing (NLP), the development and success of advanced language models are predominantly anchored in the richness of available linguistic resources. Languages such as Azerbaijani, which is classified as a low-resource, often face challenges arising from limited labeled datasets, consequently hindering effective model training.

Methodology: The primary objective of this study was to enhance the effectiveness and generalization capabilities of news text classification models using text augmentation techniques. In this study, we solve the problem of working with low-resource languages using translations using the Facebook mBart50 model, as well as the Google Translate API and a combination of mBart50 and Google Translate thus expanding the capabilities when working with text.

Results: The experimental outcomes reveal a promising uptick in classification performance when models are trained on the augmented dataset compared with their counterparts using the original data. This investigation underscores the immense potential of combined data augmentation strategies to bolster the NLP capabilities of underrepresented languages. As a result of our research, we have published our labeled text classification dataset and pre-trained RoBERTa model for the Azerbaijani language.

Keywords: Azerbaijani language; Deep learning; Low-resource language; Machine learning; Natural language processing; Text augmentation; Text classification.

Grants and funding

This work was supported by the Ministry of Education and Sciences of the Republic of Kazakhstan under the following grants: #AP14871214 “Development of machine learning methods to increase the coherence of text in summaries produced by the Extractive Summarization Methods” and #AP09260670 “Development of methods and algorithms for augmenting input data for modifying vector word embeddings”. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.