Dataset for multimodal fake news detection and verification tasks

Data Brief. 2024 Apr 16:54:110440. doi: 10.1016/j.dib.2024.110440. eCollection 2024 Jun.

Abstract

The proliferation of online disinformation and fake news, particularly in the context of breaking news events, demands the development of effective detection mechanisms. While textual content remains the predominant medium for disseminating misleading information, the contribution of other modalities is increasingly emerging within online outlets and social media platforms. However, multimodal datasets, which incorporate diverse modalities such as texts and images, are not very common yet, especially in low-resource languages. This study addresses this gap by releasing a dataset tailored for multimodal fake news detection in the Italian language. This dataset was originally employed in a shared task on the Italian language. The dataset is divided into two data subsets, each corresponding to a distinct sub-task. In sub-task 1, the goal is to assess the effectiveness of multimodal fake news detection systems. Sub-task 2 aims to delve into the interplay between text and images, specifically analyzing how these modalities mutually influence the interpretation of content when distinguishing between fake and real news. Both sub-tasks were managed as classification problems. The dataset consists of social media posts and news articles. After collecting it, it was labeled via crowdsourcing. Annotators were provided with external knowledge about the topic of the news to be labeled, enhancing their ability to discriminate between fake and real news. The data subsets for sub-task 1 and sub-task 2 consist of 913 and 1350 items, respectively, encompassing newspaper articles and tweets.

Keywords: Data collection and annotation; Fake news; Machine learning; Multimodal data; Natural language processing.

Publication types

  • News