A performance comparison of supervised machine learning models for Covid-19 tweets sentiment analysis

Furqan Rustam; Madiha Khalid; Waqar Aslam; Vaibhav Rupapara; Arif Mehmood; Gyu Sang Choi

doi:10.1371/journal.pone.0245909

A performance comparison of supervised machine learning models for Covid-19 tweets sentiment analysis

PLoS One. 2021 Feb 25;16(2):e0245909. doi: 10.1371/journal.pone.0245909. eCollection 2021.

Authors

Furqan Rustam¹, Madiha Khalid¹, Waqar Aslam², Vaibhav Rupapara³, Arif Mehmood², Gyu Sang Choi⁴

Affiliations

¹ Department of Computer Science, Khwaja Fareed University of Engineering and Information Technology, Rahim Yar Khan, Pakistan.
² Department of Computer Science & Information Technology, The Islamia University of Bahawalpur, Bahawalpur, Punjab, Pakistan.
³ School of Computing and Information Sciences Florida International University, Miami, FL, United States of America.
⁴ Department of Information & Communication Engineering, Yeungnam University, Gyeongsan, Gyeongbuk, Korea.

Abstract

The spread of Covid-19 has resulted in worldwide health concerns. Social media is increasingly used to share news and opinions about it. A realistic assessment of the situation is necessary to utilize resources optimally and appropriately. In this research, we perform Covid-19 tweets sentiment analysis using a supervised machine learning approach. Identification of Covid-19 sentiments from tweets would allow informed decisions for better handling the current pandemic situation. The used dataset is extracted from Twitter using IDs as provided by the IEEE data port. Tweets are extracted by an in-house built crawler that uses the Tweepy library. The dataset is cleaned using the preprocessing techniques and sentiments are extracted using the TextBlob library. The contribution of this work is the performance evaluation of various machine learning classifiers using our proposed feature set. This set is formed by concatenating the bag-of-words and the term frequency-inverse document frequency. Tweets are classified as positive, neutral, or negative. Performance of classifiers is evaluated on the accuracy, precision, recall, and F1 score. For completeness, further investigation is made on the dataset using the Long Short-Term Memory (LSTM) architecture of the deep learning model. The results show that Extra Trees Classifiers outperform all other models by achieving a 0.93 accuracy score using our proposed concatenated features set. The LSTM achieves low accuracy as compared to machine learning classifiers. To demonstrate the effectiveness of our proposed feature set, the results are compared with the Vader sentiment analysis technique based on the GloVe feature extraction approach.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

COVID-19*
Deep Learning
Humans
Natural Language Processing
Pandemics
Public Opinion
Social Media*
Supervised Machine Learning*

Grants and funding

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2019R1A2C1006159), awarded to GSC, and MSIT(Ministry of Science and ICT), Korea, under the ITRC(Information Technology Research Center) support program (IITP-2020-2016-0-00313) supervised by the IITP(Institute for Information & communications Technology Promotion), also awarded to GSC.