Comparing writing style feature-based classification methods for estimating user reputations in social media

Jong Hwan Suh

doi:10.1186/s40064-016-1841-1

Comparing writing style feature-based classification methods for estimating user reputations in social media

Springerplus. 2016 Mar 2:5:261. doi: 10.1186/s40064-016-1841-1. eCollection 2016.

Author

Jong Hwan Suh¹

Affiliation

¹ Moon Soul Graduate School of Future Strategy, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon, 34141 Republic of Korea.

Abstract

In recent years, the anonymous nature of the Internet has made it difficult to detect manipulated user reputations in social media, as well as to ensure the qualities of users and their posts. To deal with this, this study designs and examines an automatic approach that adopts writing style features to estimate user reputations in social media. Under varying ways of defining Good and Bad classes of user reputations based on the collected data, it evaluates the classification performance of the state-of-art methods: four writing style features, i.e. lexical, syntactic, structural, and content-specific, and eight classification techniques, i.e. four base learners-C4.5, Neural Network (NN), Support Vector Machine (SVM), and Naïve Bayes (NB)-and four Random Subspace (RS) ensemble methods based on the four base learners. When South Korea's Web forum, Daum Agora, was selected as a test bed, the experimental results show that the configuration of the full feature set containing content-specific features and RS-SVM combining RS and SVM gives the best accuracy for classification if the test bed poster reputations are segmented strictly into Good and Bad classes by portfolio approach. Pairwise t tests on accuracy confirm two expectations coming from the literature reviews: first, the feature set adding content-specific features outperform the others; second, ensemble learning methods are more viable than base learners. Moreover, among the four ways on defining the classes of user reputations, i.e. like, dislike, sum, and portfolio, the results show that the portfolio approach gives the highest accuracy.

Keywords: Classification techniques; Comparative studies; Ensemble learning; Social media; User reputation estimation; Writing style features.