Performance of ChatGPT, Bard, Claude, and Bing on the Peruvian National Licensing Medical Examination: a cross-sectional study

Betzy Clariza Torres-Zegarra; Wagner Rios-Garcia; Alvaro Micael Ñaña-Cordova; Karen Fatima Arteaga-Cisneros; Xiomara Cristina Benavente Chalco; Marina Atena Bustamante Ordoñez; Carlos Jesus Gutierrez Rios; Carlos Alberto Ramos Godoy; Kristell Luisa Teresa Panta Quezada; Jesus Daniel Gutierrez-Arratia; Javier Alejandro Flores-Cohaila

doi:10.3352/jeehp.2023.20.30

Performance of ChatGPT, Bard, Claude, and Bing on the Peruvian National Licensing Medical Examination: a cross-sectional study

J Educ Eval Health Prof. 2023:20:30. doi: 10.3352/jeehp.2023.20.30. Epub 2023 Nov 20.

Authors

Betzy Clariza Torres-Zegarra¹, Wagner Rios-Garcia², Alvaro Micael Ñaña-Cordova¹, Karen Fatima Arteaga-Cisneros¹, Xiomara Cristina Benavente Chalco¹, Marina Atena Bustamante Ordoñez¹, Carlos Jesus Gutierrez Rios¹, Carlos Alberto Ramos Godoy^{3

4}, Kristell Luisa Teresa Panta Quezada⁴, Jesus Daniel Gutierrez-Arratia^{4

5}, Javier Alejandro Flores-Cohaila^{1

4}

Affiliations

¹ Escuela de Medicina, Universidad Cientifica del Sur, Lima, Peru.
² Sociedad Científica de Estudiantes de Medicina de Ica, Universidad Nacional San Luis Gonzaga, Ica, Peru.
³ Universidad Nacional de Cajamarca, Cajamarca, Peru.
⁴ Academic Department, USAMEDIC, Lima, Peru.
⁵ Neurogenetics Research Center, Instituto Nacional de Ciencias Neurologicas, Lima, Peru.

Abstract

Purpose: We aimed to describe the performance and evaluate the educational value of justifications provided by artificial intelligence chatbots, including GPT-3.5, GPT-4, Bard, Claude, and Bing, on the Peruvian National Medical Licensing Examination (P-NLME).

Methods: This was a cross-sectional analytical study. On July 25, 2023, each multiple-choice question (MCQ) from the P-NLME was entered into each chatbot (GPT-3, GPT-4, Bing, Bard, and Claude) 3 times. Then, 4 medical educators categorized the MCQs in terms of medical area, item type, and whether the MCQ required Peru-specific knowledge. They assessed the educational value of the justifications from the 2 top performers (GPT-4 and Bing).

Results: GPT-4 scored 86.7% and Bing scored 82.2%, followed by Bard and Claude, and the historical performance of Peruvian examinees was 55%. Among the factors associated with correct answers, only MCQs that required Peru-specific knowledge had lower odds (odds ratio, 0.23; 95% confidence interval, 0.09-0.61), whereas the remaining factors showed no associations. In assessing the educational value of justifications provided by GPT-4 and Bing, neither showed any significant differences in certainty, usefulness, or potential use in the classroom.

Conclusion: Among chatbots, GPT-4 and Bing were the top performers, with Bing performing better at Peru-specific MCQs. Moreover, the educational value of justifications provided by the GPT-4 and Bing could be deemed appropriate. However, it is essential to start addressing the educational value of these chatbots, rather than merely their performance on examinations.

Keywords: Artificial intelligence; Educational measurement; Medical education; Peru.

MeSH terms

Artificial Intelligence*
Cross-Sectional Studies
Educational Status
Humans
Knowledge*
Peru