Bias and comparison framework for abusive language datasets

AI Ethics. 2022;2(1):79-101. doi: 10.1007/s43681-021-00081-0. Epub 2021 Jul 19.

Abstract

Recently, numerous datasets have been produced as research activities in the field of automatic detection of abusive language or hate speech have increased. A problem with this diversity is that they often differ, among other things, in context, platform, sampling process, collection strategy, and labeling schema. There have been surveys on these datasets, but they compare the datasets only superficially. Therefore, we developed a bias and comparison framework for abusive language datasets for their in-depth analysis and to provide a comparison of five English and six Arabic datasets. We make this framework available to researchers and data scientists who work with such datasets to be aware of the properties of the datasets and consider them in their work.

Keywords: Abusive language detection; Arabic; Bias; English; Hate speech detection.

Publication types

  • Review