An iterative topic model filtering framework for short and noisy user-generated data: analyzing conspiracy theories on twitter

Gillian Kant; Levin Wiebelt; Christoph Weisser; Krisztina Kis-Katos; Mattias Luber; Benjamin Säfken

doi:10.1007/s41060-022-00321-4

An iterative topic model filtering framework for short and noisy user-generated data: analyzing conspiracy theories on twitter

Int J Data Sci Anal. 2022 May 6:1-21. doi: 10.1007/s41060-022-00321-4. Online ahead of print.

Authors

Gillian Kant¹, Levin Wiebelt¹, Christoph Weisser², Krisztina Kis-Katos², Mattias Luber¹, Benjamin Säfken²

Affiliations

¹ University of Göttingen, Göttingen, Germany.
² Campus-Institut Data Science (CIDAS), University of Göttingen, Göttingen, Germany.

Abstract

Conspiracy theories have seen a rise in popularity in recent years. Spreading quickly through social media, their disruptive effect can lead to a biased public view on policy decisions and events. We present a novel approach for LDA-pre-processing called Iterative Filtering to study such phenomena based on Twitter data. In combination with Hashtag Pooling as an additional pre-processing step, we are able to achieve a coherent framing of the discussion and topics of interest, despite of the inherent noisiness and sparseness of Twitter data. Our novel approach enables researchers to gain detailed insights into discourses of interest on Twitter, allowing them to identify tweets iteratively that are related to an investigated topic of interest. As an application, we study the dynamics of conspiracy-related topics on US Twitter during the last four months of 2020, which were dominated by the US-Presidential Elections and Covid-19. We monitor the public discourse in the USA with geo-spatial Twitter data to identify conspiracy-related contents by estimating Latent Dirichlet Allocation (LDA) Topic Models. We find that in this period, usual conspiracy-related topics played a marginal role in comparison with dominating topics, such as the US-Presidential Elections or the general discussions about Covid-19. The main conspiracy theories in this period were the ones linked to "Election Fraud" and the "Covid-19-hoax." Conspiracy-related keywords tended to appear together with Trump-related words and words related to his presidential campaign.

Keywords: Conspiracy theories; Covid-19; Geo-spatial analysis; Hashtag pooling; Iterative filtering; LDA; Latent Dirichlet allocation; NLP; NLP pre-processing; SARS-CoV-2; Sentiment analysis.