Can AI pass the written European Board Examination in Neurological Surgery? - Ethical and practical issues

Felix C Stengel; Martin N Stienen; Marcel Ivanov; María L Gandía-González; Giovanni Raffa; Mario Ganau; Peter Whitfield; Stefan Motov

doi:10.1016/j.bas.2024.102765

Can AI pass the written European Board Examination in Neurological Surgery? - Ethical and practical issues

Brain Spine. 2024 Feb 13:4:102765. doi: 10.1016/j.bas.2024.102765. eCollection 2024.

Authors

Felix C Stengel¹, Martin N Stienen¹, Marcel Ivanov², María L Gandía-González³, Giovanni Raffa⁴, Mario Ganau⁵, Peter Whitfield⁶, Stefan Motov¹

Affiliations

¹ Department of Neurosurgery & Spine Center of Eastern Switzerland, Kantonsspital St. Gallen & Medical School of St.Gallen, St. Gallen, Switzerland.
² Royal Hallamshire Hospital, Sheffield, United Kingdom.
³ Hospital Universitario La Paz, Madrid, Spain.
⁴ Division of Neurosurgery, BIOMORF Department, University of Messina, Messina, Italy.
⁵ Oxford University Hospitals NHS Foundation Trust, Oxford, United Kingdom.
⁶ South West Neurosurgery Centre, Plymouth, United Kingdom.

Abstract

Introduction: Artificial intelligence (AI) based large language models (LLM) contain enormous potential in education and training. Recent publications demonstrated that they are able to outperform participants in written medical exams.

Research question: We aimed to explore the accuracy of AI in the written part of the EANS board exam.

Material and methods: Eighty-six representative single best answer (SBA) questions, included at least ten times in prior EANS board exams, were selected by the current EANS board exam committee. The questions' content was classified as 75 text-based (TB) and 11 image-based (IB) and their structure as 50 interpretation-weighted, 30 theory-based and 6 true-or-false. Questions were tested with Chat GPT 3.5, Bing and Bard. The AI and participant results were statistically analyzed through ANOVA tests with Stata SE 15 (StataCorp, College Station, TX). P-values of <0.05 were considered as statistically significant.

Results: The Bard LLM achieved the highest accuracy with 62% correct questions overall and 69% excluding IB, outperforming human exam participants 59% (p = 0.67) and 59% (p = 0.42), respectively. All LLMs scored highest in theory-based questions, excluding IB questions (Chat-GPT: 79%; Bing: 83%; Bard: 86%) and significantly better than the human exam participants (60%; p = 0.03). AI could not answer any IB question correctly.

Discussion and conclusion: AI passed the written EANS board exam based on representative SBA questions and achieved results close to or even better than the human exam participants. Our results raise several ethical and practical implications, which may impact the current concept for the written EANS board exam.

Keywords: Artificial intelligence; Bard; Bing; Board-certification; Chat gpt; EANS; Neurosurgery board examination.