An Automated Toxicity Classification on Social Media Using LSTM and Word Embedding

Ahmad Alsharef; Karan Aggarwal; Sonia; Deepika Koundal; Hashem Alyami; Darine Ameyed

doi:10.1155/2022/8467349

An Automated Toxicity Classification on Social Media Using LSTM and Word Embedding

Comput Intell Neurosci. 2022 Feb 15:2022:8467349. doi: 10.1155/2022/8467349. eCollection 2022.

Authors

Ahmad Alsharef¹, Karan Aggarwal², Sonia¹, Deepika Koundal³, Hashem Alyami⁴, Darine Ameyed⁵

Affiliations

¹ Yogananda School of Artificial Intelligence, Computing and Data Science, Shoolini University, Solan, Himachal Pradesh 173229, India.
² Electronics and Communication Engineering Department, Maharishi Markandeshwar (Deemed to be University), Mullana, Ambala 133207, India.
³ Department of Systemics, School of Computer Science, University of Petroleum & Energy Studies, Dehradun, India.
⁴ Department of Computer Science, College of Computers and Information Technology, Taif University, P.O. Box 11099, Taif 21944, Saudi Arabia.
⁵ System Engineering Department, Ecole de Technologie Supérieure, University of Quebec, Montreal, Canada.

Abstract

The automated identification of toxicity in texts is a crucial area in text analysis since the social media world is replete with unfiltered content that ranges from mildly abusive to downright hateful. Researchers have found an unintended bias and unfairness caused by training datasets, which caused an inaccurate classification of toxic words in context. In this paper, several approaches for locating toxicity in texts are assessed and presented aiming to enhance the overall quality of text classification. General unsupervised methods were used depending on the state-of-art models and external embeddings to improve the accuracy while relieving bias and enhancing F1-score. Suggested approaches used a combination of long short-term memory (LSTM) deep learning model with Glove word embeddings and LSTM with word embeddings generated by the Bidirectional Encoder Representations from Transformers (BERT), respectively. These models were trained and tested on large secondary qualitative data containing a large number of comments classified as toxic or not. Results found that acceptable accuracy of 94% and an F1-score of 0.89 were achieved using LSTM with BERT word embeddings in the binary classification of comments (toxic and nontoxic). A combination of LSTM and BERT performed better than both LSTM unaccompanied and LSTM with Glove word embedding. This paper tries to solve the problem of classifying comments with high accuracy by pertaining models with larger corpora of text (high-quality word embedding) rather than the training data solely.

Publication types

Retracted Publication

MeSH terms

Data Accuracy
Data Collection
Humans
Machine Learning
Natural Language Processing
Social Media*