Rethink reporting of evaluation results in AI

Science. 2023 Apr 14;380(6641):136-138. doi: 10.1126/science.adf6369. Epub 2023 Apr 13.

Ryan Burnell¹, Wout Schellaert², John Burden^{1

3}, Tomer D Ullman⁴, Fernando Martinez-Plumed², Joshua B Tenenbaum⁵, Danaja Rutar¹, Lucy G Cheke^{1

6}, Jascha Sohl-Dickstein⁷, Melanie Mitchell⁸, Douwe Kiela⁹, Murray Shanahan^{10

11}, Ellen M Voorhees¹², Anthony G Cohn^{13

14

15

16}, Joel Z Leibo¹⁰, Jose Hernandez-Orallo^{1

2

3}

¹ Leverhulme Centre for the Future of Intelligence, University of Cambridge, Cambridge, UK.
² Valencian Research Institute for Artificial Intelligence, Universitat Politècnica de Valencia, València, Spain.
³ Centre for the Study of Existential Risk, University of Cambridge, Cambridge, UK.
⁴ Department of Psychology, Harvard University, Cambridge, MA, USA.
⁵ Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA, USA.
⁶ Department of Psychology, University of Cambridge, Cambridge, UK.
⁷ Brain team, Google, Mountainview, CA, USA.
⁸ Santa Fe Institute, Santa Fe, NM, USA.
⁹ Stanford University, Stanford, CA, USA.
¹⁰ DeepMind, London, UK.
¹¹ Department of Computing, Imperial College London, London, UK.
¹² National Institute of Standards and Technology (Retired), Gaithersburg, MD, USA.
¹³ School of Computing, University of Leeds, Leeds, UK.
¹⁴ Alan Turing Institute, London, UK.
¹⁵ Tongji University, Shanghai, China.
¹⁶ Shandong University, Jinan, China.

Abstract

Aggregate metrics and lack of access to results limit understanding.