Mapping global dynamics of benchmark creation and saturation in artificial intelligence

Simon Ott; Adriano Barbosa-Silva; Kathrin Blagec; Jan Brauner; Matthias Samwald

doi:10.1038/s41467-022-34591-0

Mapping global dynamics of benchmark creation and saturation in artificial intelligence

Nat Commun. 2022 Nov 10;13(1):6793. doi: 10.1038/s41467-022-34591-0.

Authors

Simon Ott^#¹, Adriano Barbosa-Silva^#^{1

2}, Kathrin Blagec¹, Jan Brauner^{3

4}, Matthias Samwald⁵

Affiliations

¹ Institute of Artificial Intelligence, Medical University of Vienna. Währingerstraße 25a, 1090, Vienna, Austria.
² ITTM S.A.-Information Technology for Translational Medicine, Esch-sur-Alzette, 4354, Luxembourg.
³ Oxford Applied and Theoretical Machine Learning (OATML) Group, Department of Computer Science, University of Oxford, Oxford, UK.
⁴ Future of Humanity Institute, University of Oxford, Oxford, UK.
⁵ Institute of Artificial Intelligence, Medical University of Vienna. Währingerstraße 25a, 1090, Vienna, Austria. matthias.samwald@meduniwien.ac.at.

^# Contributed equally.

Abstract

Benchmarks are crucial to measuring and steering progress in artificial intelligence (AI). However, recent studies raised concerns over the state of AI benchmarking, reporting issues such as benchmark overfitting, benchmark saturation and increasing centralization of benchmark dataset creation. To facilitate monitoring of the health of the AI benchmarking ecosystem, we introduce methodologies for creating condensed maps of the global dynamics of benchmark creation and saturation. We curate data for 3765 benchmarks covering the entire domains of computer vision and natural language processing, and show that a large fraction of benchmarks quickly trends towards near-saturation, that many benchmarks fail to find widespread utilization, and that benchmark performance gains for different AI tasks are prone to unforeseen bursts. We analyze attributes associated with benchmark popularity, and conclude that future benchmarks should emphasize versatility, breadth and real-world utility.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Artificial Intelligence*
Benchmarking* / methods
Ecosystem
Physical Phenomena