A normalization model for repeated letters in social media hate speech text based on rules and spelling correction

Zainab Mansur; Nazlia Omar; Sabrina Tiun; Eissa M Alshari

doi:10.1371/journal.pone.0299652

A normalization model for repeated letters in social media hate speech text based on rules and spelling correction

PLoS One. 2024 Mar 21;19(3):e0299652. doi: 10.1371/journal.pone.0299652. eCollection 2024.

Authors

Zainab Mansur¹, Nazlia Omar¹, Sabrina Tiun¹, Eissa M Alshari²

Affiliations

¹ Center for AI Technology (CAIT), FTSM, Universiti Kebangsaan Malaysia, UKM, Bangi, Malaysia.
² Department of Computer Scence, Ibb University, Ibb, Yemen.

Abstract

As social media booms, abusive online practices such as hate speech have unfortunately increased as well. As letters are often repeated in words used to construct social media messages, these types of words should be eliminated or reduced in number to enhance the efficacy of hate speech detection. Although multiple models have attempted to normalize out-of-vocabulary (OOV) words with repeated letters, they often fail to determine whether the in-vocabulary (IV) replacement words are correct or incorrect. Therefore, this study developed an improved model for normalizing OOV words with repeated letters by replacing them with correct in-vocabulary (IV) replacement words. The improved normalization model is an unsupervised method that does not require the use of a special dictionary or annotated data. It combines rule-based patterns of words with repeated letters and the SymSpell spelling correction algorithm to remove repeated letters within the words by multiple rules regarding the position of repeated letters in a word, be it at the beginning, middle, or end of the word and the repetition pattern. Two hate speech datasets were then used to assess performance. The proposed normalization model was able to decrease the percentage of OOV words to 8%. Its F1 score was also 9% and 13% higher than the models proposed by two extant studies. Therefore, the proposed normalization model performed better than the benchmark studies in replacing OOV words with the correct IV replacement and improved the performance of the detection model. As such, suitable rule-based patterns can be combined with spelling correction to develop a text normalization model to correctly replace words with repeated letters, which would, in turn, improve hate speech detection in texts.

Copyright: © 2024 Mansur et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

MeSH terms

Hate
Humans
Language
Social Media*
Speech*
Vocabulary

Grants and funding

This work was supported by the grant FRGS/1/2020/ICT02/UKM/02/6 and TAP-K007009 from Universiti Kebangsaan Malaysia and the Ministry of Higher Education (MOHE). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.