Performance of ChatGPT incorporated chain-of-thought method in bilingual nuclear medicine physician board examinations

Yu-Ting Ting; Te-Chun Hsieh; Yuh-Feng Wang; Yu-Chieh Kuo; Yi-Jin Chen; Pak-Ki Chan; Chia-Hung Kao

doi:10.1177/20552076231224074

Performance of ChatGPT incorporated chain-of-thought method in bilingual nuclear medicine physician board examinations

Digit Health. 2024 Jan 5:10:20552076231224074. doi: 10.1177/20552076231224074. eCollection 2024 Jan-Dec.

Authors

Yu-Ting Ting¹, Te-Chun Hsieh^{1

2}, Yuh-Feng Wang^{3

4

5}, Yu-Chieh Kuo⁶, Yi-Jin Chen⁶, Pak-Ki Chan⁶, Chia-Hung Kao^{1

6

7

8}

Affiliations

¹ Department of Nuclear Medicine and PET Center, China Medical University Hospital, China Medical University, Taichung.
² Department of Biomedical Imaging and Radiological Science, China Medical University, Taichung.
³ Department of Nuclear Medicine, Taipei Veterans General Hospital, Taipei.
⁴ Department of Biomedical Imaging and Radiological Sciences, National Yang Ming Chiao Tung University, Taipei.
⁵ Department of Medical Imaging and Radiological Technology, Yuanpei University of Medical Technology, Hsinchu.
⁶ Artificial Intelligence Center, China Medical University Hospital, China Medical University, Taichung.
⁷ Graduate Institute of Biomedical Sciences, School of Medicine, College of Medicine, China Medical University, Taichung.
⁸ Department of Bioinformatics and Medical Engineering, Asia University, Taichung.

Abstract

Objective: This research explores the performance of ChatGPT, compared to human doctors, in bilingual, Mandarin Chinese and English, medical specialty exam in Nuclear Medicine in Taiwan.

Methods: The study employed generative pre-trained transformer (GPT-4) and integrated chain-of-thoughts (COT) method to enhance performance by triggering and explaining the thinking process to answer the question in a coherent and logical manner. Questions from the Taiwanese Nuclear Medicine Specialty Exam served as the basis for testing. The research analyzed the correctness of AI responses in different sections of the exam and explored the influence of question length and language proportion on accuracy.

Results: AI, especially ChatGPT with COT, exhibited exceptional capabilities in theoretical knowledge, clinical medicine, and handling integrated questions, often surpassing, or matching human doctor performance. However, AI struggled with questions related to medical regulations. The analysis of question length showed that questions within the 109-163 words range yielded the highest accuracy. Moreover, an increase in the proportion of English words in questions improved both AI and human accuracy.

Conclusions: This research highlights the potential and challenges of AI in the medical field. ChatGPT demonstrates significant competence in various aspects of medical knowledge. However, areas like medical regulations require improvement. The study also suggests that AI may help in evaluating exam question difficulty and maintaining fairness in examinations. These findings shed light on AI role in the medical field, with potential applications in healthcare education, exam preparation, and multilingual environments. Ongoing AI advancements are expected to further enhance AI utility in the medical domain.

Keywords: ChatGPT; chain-of-thoughts (COT); multilingual environment; nuclear medicine exam.