A noise audit of human-labeled benchmarks for machine commonsense reasoning

Mayank Kejriwal; Henrique Santos; Ke Shen; Alice M Mulvehill; Deborah L McGuinness

doi:10.1038/s41598-024-58937-4

A noise audit of human-labeled benchmarks for machine commonsense reasoning

Sci Rep. 2024 Apr 14;14(1):8609. doi: 10.1038/s41598-024-58937-4.

Authors

Mayank Kejriwal¹, Henrique Santos², Ke Shen³, Alice M Mulvehill², Deborah L McGuinness²

Affiliations

¹ Information Sciences Institute, University of Southern California, Marina del Rey, 90292, USA. kejriwal@isi.edu.
² Rensselaer Polytechnic Institute, Tetherless World Constellation, Troy, New York, USA.
³ Information Sciences Institute, University of Southern California, Marina del Rey, 90292, USA.

Abstract

With the advent of large language models, evaluating and benchmarking these systems on important AI problems has taken on newfound importance. Such benchmarking typically involves comparing the predictions of a system against human labels (or a single 'ground-truth'). However, much recent work in psychology has suggested that most tasks involving significant human judgment can have non-trivial degrees of noise. In his book, Kahneman suggests that noise may be a much more significant component of inaccuracy compared to bias, which has been studied more extensively in the AI community. This article proposes a detailed noise audit of human-labeled benchmarks in machine commonsense reasoning, an important current area of AI research. We conduct noise audits under two important experimental conditions: one in a smaller-scale but higher-quality labeling setting, and another in a larger-scale, more realistic online crowdsourced setting. Using Kahneman's framework of noise, our results consistently show non-trivial amounts of level, pattern, and system noise, even in the higher-quality setting, with comparable results in the crowdsourced setting. We find that noise can significantly influence the performance estimates that we obtain of commonsense reasoning systems, even if the 'system' is a human; in some cases, by almost 10 percent. Labeling noise also affects performance estimates of systems like ChatGPT by more than 4 percent. Our results suggest that the default practice in the AI community of assuming and using a 'single' ground-truth, even on problems requiring seemingly straightforward human judgment, may warrant empirical and methodological re-visiting.

MeSH terms

Benchmarking*
Books
Humans
Judgment
Language
Problem Solving*

Grants and funding

N660011924033/Defense Advanced Research Projects Agency