Dataset for multimodal fake news detection and verification tasks

Alessandro Bondielli; Pietro Dell'Oglio; Alessandro Lenci; Francesco Marcelloni; Lucia Passaro

doi:10.1016/j.dib.2024.110440

Dataset for multimodal fake news detection and verification tasks

Data Brief. 2024 Apr 16:54:110440. doi: 10.1016/j.dib.2024.110440. eCollection 2024 Jun.

Authors

Alessandro Bondielli¹, Pietro Dell'Oglio², Alessandro Lenci³, Francesco Marcelloni², Lucia Passaro¹

Affiliations

¹ Department of Computer Science, University of Pisa, Largo Bruno Pontecorvo, 3, 56127, Pisa, Italy.
² Department of Information Engineering, University of Pisa, Largo Lucio Lazzarino, 1, 56122, Pisa, Italy.
³ Department of Philology, Literature and Linguistics, University of Pisa, Via S. Maria 36, 56127, Pisa, Italy.

Abstract

The proliferation of online disinformation and fake news, particularly in the context of breaking news events, demands the development of effective detection mechanisms. While textual content remains the predominant medium for disseminating misleading information, the contribution of other modalities is increasingly emerging within online outlets and social media platforms. However, multimodal datasets, which incorporate diverse modalities such as texts and images, are not very common yet, especially in low-resource languages. This study addresses this gap by releasing a dataset tailored for multimodal fake news detection in the Italian language. This dataset was originally employed in a shared task on the Italian language. The dataset is divided into two data subsets, each corresponding to a distinct sub-task. In sub-task 1, the goal is to assess the effectiveness of multimodal fake news detection systems. Sub-task 2 aims to delve into the interplay between text and images, specifically analyzing how these modalities mutually influence the interpretation of content when distinguishing between fake and real news. Both sub-tasks were managed as classification problems. The dataset consists of social media posts and news articles. After collecting it, it was labeled via crowdsourcing. Annotators were provided with external knowledge about the topic of the news to be labeled, enhancing their ability to discriminate between fake and real news. The data subsets for sub-task 1 and sub-task 2 consist of 913 and 1350 items, respectively, encompassing newspaper articles and tweets.

Keywords: Data collection and annotation; Fake news; Machine learning; Multimodal data; Natural language processing.

Publication types

News