Does Google's Bard Chatbot perform better than ChatGPT on the European hand surgery exam?

Goetsch Thibaut; Armaghan Dabbagh; Philippe Liverneaux

doi:10.1007/s00264-023-06034-y

Does Google's Bard Chatbot perform better than ChatGPT on the European hand surgery exam?

Int Orthop. 2024 Jan;48(1):151-158. doi: 10.1007/s00264-023-06034-y. Epub 2023 Nov 15.

Authors

Goetsch Thibaut¹, Armaghan Dabbagh², Philippe Liverneaux^{3

4}

Affiliations

¹ Department of Public Health, Strasbourg University Hospital, FMTS, GMRC, 1 avenue de l'hôpital, 67000, Strasbourg cedex, France.
² Faculty of Medicine, University of Toronto, Toronto, ON, Canada.
³ ICube CNRS UMR7357, Strasbourg University, 2-4 rue Boussingault, 67000, Strasbourg, France. philippe.liverneaux@chru-strasbourg.fr.
⁴ Department of Hand Surgery, Strasbourg University Hospitals, FMTS, 1 avenue Molière, 67200, Strasbourg, France. philippe.liverneaux@chru-strasbourg.fr.

PMID: 37968408
DOI: 10.1007/s00264-023-06034-y

Abstract

Purpose: According to a previous research, the chatbot ChatGPT® V3.5 was unable to pass the first part of the European Board of Hand Surgery (EBHS) diploma examination. This study aimed to investigate whether Google's chatbot Bard® would have superior performance compared to ChatGPT on the EBHS diploma examination.

Methods: Chatbots were asked to answer 18 EBHS multiple choice questions (MCQs) published in the Journal of Hand Surgery (European Volume) in five trials (A1 to A5). After A3, chatbots received correct answers, and after A4, incorrect answers. Consequently, their ability to modify their response was measured and compared.

Results: Bard® scored 3/18 (A1), 1/18 (A2), 4/18 (A3) and 2/18 (A4 and A5). The average percentage of correct answers was 61.1% for A1, 62.2% for A2, 64.4% for A3, 65.6% for A4, 63.3% for A5 and 63.3% for all trials combined. Agreement was moderate from A1 to A5 (kappa = 0.62 (IC95% = [0.51; 0.73])) as well as from A1 to A3 (kappa = 0.60 (IC95% = [0.47; 0.74])). The formulation of Bard® responses was homogeneous, but its learning capacity is still developing.

Conclusions: The main hypothesis of our study was not proved since Bard did not score significantly higher than ChatGPT when answering the MCQs of the EBHS diploma exam. In conclusion, neither ChatGPT® nor Bard®, in their current versions, can pass the first part of the EBHS diploma exam.

Keywords: Artificial intelligence; Bard; ChatGPT; Chatbot; Hand Surgery; Multiple-choice question.

MeSH terms

Humans
Search Engine*
Software*