Performance of ChatGPT on Chinese Master's Degree Entrance Examination in Clinical Medicine

Ke-Cheng Li; Zhi-Jun Bu; Md Shahjalal; Bai-Xiang He; Zi-Fan Zhuang; Chen Li; Jian-Ping Liu; Bin Wang; Zhao-Lan Liu

doi:10.1371/journal.pone.0301702

Performance of ChatGPT on Chinese Master's Degree Entrance Examination in Clinical Medicine

PLoS One. 2024 Apr 4;19(4):e0301702. doi: 10.1371/journal.pone.0301702. eCollection 2024.

Authors

Ke-Cheng Li¹, Zhi-Jun Bu², Md Shahjalal³, Bai-Xiang He⁴, Zi-Fan Zhuang⁵, Chen Li⁶, Jian-Ping Liu², Bin Wang¹, Zhao-Lan Liu²

Affiliations

¹ Department of Andrology, Dongzhimen Hospital, Beijing University of Chinese Medicine, Beijing, China.
² Centre for Evidence-Based Chinese Medicine, Beijing University of Chinese Medicine, Beijing, China.
³ Department of Public Health, North South University, Dhaka, Bangladesh.
⁴ Department of Gastroenterology, Dongzhimen Hospital, Beijing University of Chinese Medicine, Beijing, China.
⁵ Department of Endocrinology, Guang'anmen Hospital, China Academy of Chinese Medical Sciences, Beijing, China.
⁶ Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, Xi'an, Shannxi, China.

Abstract

Background: ChatGPT is a large language model designed to generate responses based on a contextual understanding of user queries and requests. This study utilised the entrance examination for the Master of Clinical Medicine in Traditional Chinese Medicine to assesses the reliability and practicality of ChatGPT within the domain of medical education.

Methods: We selected 330 single and multiple-choice questions from the 2021 and 2022 Chinese Master of Clinical Medicine comprehensive examinations, which did not include any images or tables. To ensure the test's accuracy and authenticity, we preserved the original format of the query and alternative test texts, without any modifications or explanations.

Results: Both ChatGPT3.5 and GPT-4 attained average scores surpassing the admission threshold. Noteworthy is that ChatGPT achieved the highest score in the Medical Humanities section, boasting a correct rate of 93.75%. However, it is worth noting that ChatGPT3.5 exhibited the lowest accuracy percentage of 37.5% in the Pathology division, while GPT-4 also displayed a relatively lower correctness percentage of 60.23% in the Biochemistry section. An analysis of sub-questions revealed that ChatGPT demonstrates superior performance in handling single-choice questions but performs poorly in multiple-choice questions.

Conclusion: ChatGPT exhibits a degree of medical knowledge and the capacity to aid in diagnosing and treating diseases. Nevertheless, enhancements are warranted to address its accuracy and reliability limitations. Imperatively, rigorous evaluation and oversight must accompany its utilization, accompanied by proactive measures to surmount prevailing constraints.

Copyright: © 2024 Li et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

MeSH terms

Artificial Intelligence*
Clinical Medicine*
Educational Measurement*
Language
Reproducibility of Results

Grants and funding

This study is supported by a grant from the National Natural Science Foundation of China (Grant No. 82374298) and the Reserve Discipline Leader Funding of Beijing University of Chinese Medicine (Grant No. 90010960920033). There was no additional external funding received for this study.