Evaluating the strengths and weaknesses of large language models in answering neurophysiology questions

Hassan Shojaee-Mend; Reza Mohebbati; Mostafa Amiri; Alireza Atarodi

doi:10.1038/s41598-024-60405-y

Evaluating the strengths and weaknesses of large language models in answering neurophysiology questions

Sci Rep. 2024 May 11;14(1):10785. doi: 10.1038/s41598-024-60405-y.

Authors

Hassan Shojaee-Mend¹, Reza Mohebbati², Mostafa Amiri^{1

3}, Alireza Atarodi⁴

Affiliations

¹ Department of General Courses, Faculty of Medicine, Gonabad University of Medical Sciences, Gonabad, Iran.
² Department of Physiology, Faculty of Medicine, Gonabad University of Medical Sciences, Gonabad, Iran.
³ Department of English Language and General Courses, School of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran.
⁴ Department of Knowledge and Information Science, Paramedical College and Social Development & Health Promotion Research Center, Gonabad University of Medical Sciences, Gonabad, Iran. aratarodi1387@yahoo.com.

Abstract

Large language models (LLMs), like ChatGPT, Google's Bard, and Anthropic's Claude, showcase remarkable natural language processing capabilities. Evaluating their proficiency in specialized domains such as neurophysiology is crucial in understanding their utility in research, education, and clinical applications. This study aims to assess and compare the effectiveness of Large Language Models (LLMs) in answering neurophysiology questions in both English and Persian (Farsi) covering a range of topics and cognitive levels. Twenty questions covering four topics (general, sensory system, motor system, and integrative) and two cognitive levels (lower-order and higher-order) were posed to the LLMs. Physiologists scored the essay-style answers on a scale of 0-5 points. Statistical analysis compared the scores across different levels such as model, language, topic, and cognitive levels. Performing qualitative analysis identified reasoning gaps. In general, the models demonstrated good performance (mean score = 3.87/5), with no significant difference between language or cognitive levels. The performance was the strongest in the motor system (mean = 4.41) while the weakest was observed in integrative topics (mean = 3.35). Detailed qualitative analysis uncovered deficiencies in reasoning, discerning priorities, and knowledge integrating. This study offers valuable insights into LLMs' capabilities and limitations in the field of neurophysiology. The models demonstrate proficiency in general questions but face challenges in advanced reasoning and knowledge integration. Targeted training could address gaps in knowledge and causal reasoning. As LLMs evolve, rigorous domain-specific assessments will be crucial for evaluating advancements in their performance.

Keywords: Bloom’s taxonomy; Evaluation; Large language models; Neurophysiology.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Cognition / physiology
Humans
Language*
Natural Language Processing
Neurophysiology* / methods