Classification of movie reviews using term frequency-inverse document frequency and optimized machine learning algorithms

Muhammad Zaid Naeem; Furqan Rustam; Arif Mehmood; Mui-Zzud-Din; Imran Ashraf; Gyu Sang Choi

doi:10.7717/peerj-cs.914

Classification of movie reviews using term frequency-inverse document frequency and optimized machine learning algorithms

PeerJ Comput Sci. 2022 Mar 15:8:e914. doi: 10.7717/peerj-cs.914. eCollection 2022.

Authors

Muhammad Zaid Naeem^#¹, Furqan Rustam^#¹, Arif Mehmood², Mui-Zzud-Din¹, Imran Ashraf³, Gyu Sang Choi³

Affiliations

¹ Department of Computer Science, Khwaja Fareed University of Engineering and Information Technology, Rahim Yar Khan, Pakistan.
² Department of Computer Science & Information Technology, The Islamia University of Bahawalpur, Bahawalpur, Pakistan.
³ Information and Communication Engineering, Yeungnam University, Gyeongsan si, Daegu, South Korea.

^# Contributed equally.

Abstract

The Internet Movie Database (IMDb), being one of the popular online databases for movies and personalities, provides a wide range of movie reviews from millions of users. This provides a diverse and large dataset to analyze users' sentiments about various personalities and movies. Despite being helpful to provide the critique of movies, the reviews on IMDb cannot be read as a whole and requires automated tools to provide insights on the sentiments in such reviews. This study provides the implementation of various machine learning models to measure the polarity of the sentiments presented in user reviews on the IMDb website. For this purpose, the reviews are first preprocessed to remove redundant information and noise, and then various classification models like support vector machines (SVM), Naïve Bayes classifier, random forest, and gradient boosting classifiers are used to predict the sentiment of these reviews. The objective is to find the optimal process and approach to attain the highest accuracy with the best generalization. Various feature engineering approaches such as term frequency-inverse document frequency (TF-IDF), bag of words, global vectors for word representations, and Word2Vec are applied along with the hyperparameter tuning of the classification models to enhance the classification accuracy. Experimental results indicate that the SVM obtains the highest accuracy when used with TF-IDF features and achieves an accuracy of 89.55%. The sentiment classification accuracy of the models is affected due to the contradictions in the user sentiments in the reviews and assigned labels. For tackling this issue, TextBlob is used to assign a sentiment to the dataset containing reviews before it can be used for training. Experimental results on TextBlob assigned sentiments indicate that an accuracy of 92% can be obtained using the proposed model.

Keywords: Bag of words; Movies reviews; Sentiment classification; Supervised machine learning; Text analysis.

Grants and funding

This work was supported in part by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2019R1A2C1006159 and NRF-2021R1A6A1A03039493), and in part by the 2021 Yeungnam University Research Grant. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.