Can large language models pass official high-grade exams of the European Society of Neuroradiology courses? A direct comparison between OpenAI chatGPT 3.5, OpenAI GPT4 and Google Bard

Gennaro D'Anna; Sofie Van Cauter; Majda Thurnher; Johan Van Goethem; Sven Haller

doi:10.1007/s00234-024-03371-6

Can large language models pass official high-grade exams of the European Society of Neuroradiology courses? A direct comparison between OpenAI chatGPT 3.5, OpenAI GPT4 and Google Bard

Neuroradiology. 2024 May 6. doi: 10.1007/s00234-024-03371-6. Online ahead of print.

Authors

Gennaro D'Anna¹, Sofie Van Cauter^{2

3}, Majda Thurnher⁴, Johan Van Goethem^{5

6}, Sven Haller^{7

8

9

10}

Affiliations

¹ Neuroimaging Unit, ASST Ovest Milanese, Legnano, Milan, Italy. gennaro.danna@gmail.com.
² Department of Medical Imaging, Ziekenhuis Oost-Limburg, Genk, Belgium.
³ Department of Medicine and Life Sciences, Hasselt University, Hasselt, Belgium.
⁴ Department for Biomedical Imaging and Image-Guided Therapy, Medical University of Vienna, Vienna, Austria.
⁵ Department of Medical and Molecular Imaging, VITAZ, Sint-Niklaas, Belgium.
⁶ Department of Radiology, University Hospital Antwerp, Antwerp, Belgium.
⁷ CIMC-Centre d'Imagerie Médicale de Cornavin, Geneva, Switzerland.
⁸ Department of Surgical Sciences, Radiology, Uppsala University, Uppsala, Sweden.
⁹ Faculty of Medicine, University of Geneva, Geneva, Switzerland.
¹⁰ Department of Radiology, Beijing Tiantan Hospital, Capital Medical University, Beijing, People's Republic of China.

PMID: 38705899
DOI: 10.1007/s00234-024-03371-6

Abstract

We compared different LLMs, notably chatGPT, GPT4, and Google Bard and we tested whether their performance differs in subspeciality domains, in executing examinations from four different courses of the European Society of Neuroradiology (ESNR) notably anatomy/embryology, neuro-oncology, head and neck and pediatrics. Written exams of ESNR were used as input data, related to anatomy/embryology (30 questions), neuro-oncology (50 questions), head and neck (50 questions), and pediatrics (50 questions). All exams together, and each exam separately were introduced to the three LLMs: chatGPT 3.5, GPT4, and Google Bard. Statistical analyses included a group-wise Friedman test followed by a pair-wise Wilcoxon test with multiple comparison corrections. Overall, there was a significant difference between the 3 LLMs (p < 0.0001), with GPT4 having the highest accuracy (70%), followed by chatGPT 3.5 (54%) and Google Bard (36%). The pair-wise comparison showed significant differences between chatGPT vs GPT 4 (p < 0.0001), chatGPT vs Bard (p < 0. 0023), and GPT4 vs Bard (p < 0.0001). Analyses per subspecialty showed the highest difference between the best LLM (GPT4, 70%) versus the worst LLM (Google Bard, 24%) in the head and neck exam, while the difference was least pronounced in neuro-oncology (GPT4, 62% vs Google Bard, 48%). We observed significant differences in the performance of the three different LLMs in the running of official exams organized by ESNR. Overall GPT 4 performed best, and Google Bard performed worst. This difference varied depending on subspeciality and was most pronounced in head and neck subspeciality.

Keywords: AI; GPT4; LLM; Neuroradiology; chatGPT.