Keeping Up With ChatGPT: Evaluating Its Recognition and Interpretation of Nuclear Medicine Images

Julian M M Rogasch; Hans V Jochens; Giulia Metzger; Christoph Wetz; Jonas Kaufmann; Christian Furth; Holger Amthauer; Imke Schatka

doi:10.1097/RLU.0000000000005207

Keeping Up With ChatGPT: Evaluating Its Recognition and Interpretation of Nuclear Medicine Images

Clin Nucl Med. 2024 Jun 1;49(6):500-504. doi: 10.1097/RLU.0000000000005207. Epub 2024 Apr 1.

Authors

Julian M M Rogasch, Hans V Jochens¹, Giulia Metzger¹, Christoph Wetz¹, Jonas Kaufmann¹, Christian Furth¹, Holger Amthauer¹, Imke Schatka¹

Affiliation

¹ From the Department of Nuclear Medicine, Charité-Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin.

PMID: 38661379
DOI: 10.1097/RLU.0000000000005207

Abstract

Purpose: The latest iteration of GPT4 (generative pretrained transformer) is a large multimodal model that can integrate both text and image input, but its performance with medical images has not been systematically evaluated. We studied whether ChatGPT with GPT-4V(ision) can recognize images from common nuclear medicine examinations and interpret them.

Patients and methods: Fifteen representative images (scintigraphy, 11; PET, 4) were submitted to ChatGPT with GPT-4V(ision), both in its Default and "Advanced Data Analysis (beta)" version. ChatGPT was asked to name the type of examination and tracer, explain the findings and whether there are abnormalities. ChatGPT should also mark anatomical structures or pathological findings. The appropriateness of the responses was rated by 3 nuclear medicine physicians.

Results: The Default version identified the examination and the tracer correctly in the majority of the 15 cases (60% or 53%) and gave an "appropriate" description of the findings or abnormalities in 47% or 33% of cases, respectively. The Default version cannot manipulate images. "Advanced Data Analysis (beta)" failed in all tasks in >90% of cases. A "major" or "incompatible" inconsistency between 3 trials of the same prompt was observed in 73% (Default version) or 87% of cases ("Advanced Data Analysis (beta)" version).

Conclusions: Although GPT-4V(ision) demonstrates preliminary capabilities in analyzing nuclear medicine images, it exhibits significant limitations, particularly in its reliability (ie, correctness, predictability, and consistency).

MeSH terms

Humans
Image Interpretation, Computer-Assisted / methods
Nuclear Medicine*