Practical detection of spammers and content promoters in online video sharing systems

IEEE Trans Syst Man Cybern B Cybern. 2012 Jun;42(3):688-701. doi: 10.1109/TSMCB.2011.2173799. Epub 2011 Nov 30.

Abstract

A number of online video sharing systems, out of which YouTube is the most popular, provide features that allow users to post a video as a response to a discussion topic. These features open opportunities for users to introduce polluted content, or simply pollution, into the system. For instance, spammers may post an unrelated video as response to a popular one, aiming at increasing the likelihood of the response being viewed by a larger number of users. Moreover, content promoters may try to gain visibility to a specific video by posting a large number of (potentially unrelated) responses to boost the rank of the responded video, making it appear in the top lists maintained by the system. Content pollution may jeopardize the trust of users on the system, thus compromising its success in promoting social interactions. In spite of that, the available literature is very limited in providing a deep understanding of this problem. In this paper, we address the issue of detecting video spammers and promoters. Towards that end, we first manually build a test collection of real YouTube users, classifying them as spammers, promoters, and legitimate users. Using our test collection, we provide a characterization of content, individual, and social attributes that help distinguish each user class. We then investigate the feasibility of using supervised classification algorithms to automatically detect spammers and promoters, and assess their effectiveness in our test collection. While our classification approach succeeds at separating spammers and promoters from legitimate users, the high cost of manually labeling vast amounts of examples compromises its full potential in realistic scenarios. For this reason, we further propose an active learning approach that automatically chooses a set of examples to label, which is likely to provide the highest amount of information, drastically reducing the amount of required training data while maintaining comparable classification effectiveness.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Artificial Intelligence*
  • Computer Simulation
  • Decision Support Techniques*
  • Information Storage and Retrieval / methods*
  • Internet*
  • Models, Theoretical*
  • Online Systems
  • Pattern Recognition, Automated / methods*
  • Video Recording / methods*