Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study

Takahiro Nakao; Soichiro Miki; Yuta Nakamura; Tomohiro Kikuchi; Yukihiro Nomura; Shouhei Hanaoka; Takeharu Yoshikawa; Osamu Abe

doi:10.2196/54393

Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study

JMIR Med Educ. 2024 Mar 12:10:e54393. doi: 10.2196/54393.

Authors

Takahiro Nakao¹, Soichiro Miki¹, Yuta Nakamura¹, Tomohiro Kikuchi^{1

2}, Yukihiro Nomura^{1

3}, Shouhei Hanaoka⁴, Takeharu Yoshikawa¹, Osamu Abe⁴

Affiliations

¹ Department of Computational Diagnostic Radiology and Preventive Medicine, The University of Tokyo Hospital, Bunkyo-ku, Tokyo, Japan.
² Department of Radiology, School of Medicine, Jichi Medical University, Shimotsuke, Tochigi, Japan.
³ Center for Frontier Medical Engineering, Chiba University, Inage-ku, Chiba, Japan.
⁴ Department of Radiology, The University of Tokyo Hospital, Bunkyo-ku, Tokyo, Japan.

PMID: 38470459
PMCID: PMC10966435
DOI: 10.2196/54393

Abstract

Background: Previous research applying large language models (LLMs) to medicine was focused on text-based information. Recently, multimodal variants of LLMs acquired the capability of recognizing images.

Objective: We aim to evaluate the image recognition capability of generative pretrained transformer (GPT)-4V, a recent multimodal LLM developed by OpenAI, in the medical field by testing how visual information affects its performance to answer questions in the 117th Japanese National Medical Licensing Examination.

Methods: We focused on 108 questions that had 1 or more images as part of a question and presented GPT-4V with the same questions under two conditions: (1) with both the question text and associated images and (2) with the question text only. We then compared the difference in accuracy between the 2 conditions using the exact McNemar test.

Results: Among the 108 questions with images, GPT-4V's accuracy was 68% (73/108) when presented with images and 72% (78/108) when presented without images (P=.36). For the 2 question categories, clinical and general, the accuracies with and those without images were 71% (70/98) versus 78% (76/98; P=.21) and 30% (3/10) versus 20% (2/10; P≥.99), respectively.

Conclusions: The additional information from the images did not significantly improve the performance of GPT-4V in the Japanese National Medical Licensing Examination.

Keywords: AI; ChatGPT; GPT-4; GPT-4V; LLM; NLP; answer; answers; artificial intelligence; chatbot; chatbots; conversational agent; conversational agents; exam; examination; examinations; exams; generative pretrained transformer; image; images; imaging; language model; language models; large language model; medical education; natural language processing; response; responses.

©Takahiro Nakao, Soichiro Miki, Yuta Nakamura, Tomohiro Kikuchi, Yukihiro Nomura, Shouhei Hanaoka, Takeharu Yoshikawa, Osamu Abe. Originally published in JMIR Medical Education (https://mededu.jmir.org), 12.03.2024.

MeSH terms

Japan
Language
Licensure*
Medicine*