Topic2features: a novel framework to classify noisy and sparse textual data using LDA topic distributions

PeerJ Comput Sci. 2021 Aug 11:7:e677. doi: 10.7717/peerj-cs.677. eCollection 2021.

Abstract

In supervised machine learning, specifically in classification tasks, selecting and analyzing the feature vector to achieve better results is one of the most important tasks. Traditional methods such as comparing the features' cosine similarity and exploring the datasets manually to check which feature vector is suitable is relatively time consuming. Many classification tasks failed to achieve better classification results because of poor feature vector selection and sparseness of data. In this paper, we proposed a novel framework, topic2features (T2F), to deal with short and sparse data using the topic distributions of hidden topics gathered from dataset and converting into feature vectors to build supervised classifier. For this we leveraged the unsupervised topic modelling LDA (latent dirichlet allocation) approach to retrieve the topic distributions employed in supervised learning algorithms. We made use of labelled data and topic distributions of hidden topics that were generated from that data. We explored how the representation based on topics affect the classification performance by applying supervised classification algorithms. Additionally, we did careful evaluation on two types of datasets and compared them with baseline approaches without topic distributions and other comparable methods. The results show that our framework performs significantly better in terms of classification performance compared to the baseline(without T2F) approaches and also yields improvement in terms of F1 score compared to other compared approaches.

Keywords: Classification; Machine learning; Natural language processing; Social media; Sparse Data; Text analysis; Topic analysis.

Grants and funding

This work was supported by the National Key Technologies R&D Program (under grant number 2020YFB1712401, 2018YFB1701401), the Nature Science Foundation of China (grant number 62006210), the major project of Zhengzhou Collaborative Innovation (under grant number 20XTZX-009, 20XTZX-X010), the National Key R&D Program of China 2018 and the Key Scientific and Technological Research Projects in the Henan Province of China under grant number 192102310216, the National Key R&D Program of China (2018******02), and the 2020 Major Project Public Benefit Project in Henan Province (201300210500). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.