Evaluating large language models on a highly-specialized topic, radiation oncology physics

Jason Holmes; Zhengliang Liu; Lian Zhang; Yuzhen Ding; Terence T Sio; Lisa A McGee; Jonathan B Ashman; Xiang Li; Tianming Liu; Jiajian Shen; Wei Liu

doi:10.3389/fonc.2023.1219326

Evaluating large language models on a highly-specialized topic, radiation oncology physics

Front Oncol. 2023 Jul 17:13:1219326. doi: 10.3389/fonc.2023.1219326. eCollection 2023.

Authors

Affiliations

¹ Department of Radiation Oncology, Mayo Clinic, Phoenix, AZ, United States.
² School of Computing, The University of Georgia, Athens, GA, United States.
³ Department of Radiology, Massachusetts General Hospital, Boston, MA, United States.

Abstract

Purpose: We present the first study to investigate Large Language Models (LLMs) in answering radiation oncology physics questions. Because popular exams like AP Physics, LSAT, and GRE have large test-taker populations and ample test preparation resources in circulation, they may not allow for accurately assessing the true potential of LLMs. This paper proposes evaluating LLMs on a highly-specialized topic, radiation oncology physics, which may be more pertinent to scientific and medical communities in addition to being a valuable benchmark of LLMs.

Methods: We developed an exam consisting of 100 radiation oncology physics questions based on our expertise. Four LLMs, ChatGPT (GPT-3.5), ChatGPT (GPT-4), Bard (LaMDA), and BLOOMZ, were evaluated against medical physicists and non-experts. The performance of ChatGPT (GPT-4) was further explored by being asked to explain first, then answer. The deductive reasoning capability of ChatGPT (GPT-4) was evaluated using a novel approach (substituting the correct answer with "None of the above choices is the correct answer."). A majority vote analysis was used to approximate how well each group could score when working together.

Results: ChatGPT GPT-4 outperformed all other LLMs and medical physicists, on average, with improved accuracy when prompted to explain before answering. ChatGPT (GPT-3.5 and GPT-4) showed a high level of consistency in its answer choices across a number of trials, whether correct or incorrect, a characteristic that was not observed in the human test groups or Bard (LaMDA). In evaluating deductive reasoning ability, ChatGPT (GPT-4) demonstrated surprising accuracy, suggesting the potential presence of an emergent ability. Finally, although ChatGPT (GPT-4) performed well overall, its intrinsic properties did not allow for further improvement when scoring based on a majority vote across trials. In contrast, a team of medical physicists were able to greatly outperform ChatGPT (GPT-4) using a majority vote.

Conclusion: This study suggests a great potential for LLMs to work alongside radiation oncology experts as highly knowledgeable assistants.

Keywords: ChatGPT; artificial intelligence; large language model; medical physics; natural language processing.

Grants and funding

K25 CA168984/CA/NCI NIH HHS/United States