Translating musculoskeletal radiology reports into patient-friendly summaries using ChatGPT-4

Ian J Kuckelman; Karla Wetley; Paul Hyunsoo Yi; Andrew Bailey Ross

doi:10.1007/s00256-024-04599-2

Translating musculoskeletal radiology reports into patient-friendly summaries using ChatGPT-4

Skeletal Radiol. 2024 Jan 25. doi: 10.1007/s00256-024-04599-2. Online ahead of print.

Authors

Ian J Kuckelman¹, Karla Wetley², Paul Hyunsoo Yi³, Andrew Bailey Ross²

Affiliations

¹ University of Wisconsin School of Medicine & Public Health, 750 Highland Ave, Madison, WI, 53705, USA. kuckelman@wisc.edu.
² University of Wisconsin School of Medicine & Public Health, 750 Highland Ave, Madison, WI, 53705, USA.
³ University of Maryland School of Medicine, 655 W Baltimore St S, Baltimore, MD, 21201, USA.

PMID: 38270616
DOI: 10.1007/s00256-024-04599-2

Abstract

Objective: To assess the feasibility of using large language models (LLMs), specifically ChatGPT-4, to generate concise and accurate layperson summaries of musculoskeletal radiology reports.

Methods: Sixty radiology reports, comprising 20 MR shoulder, 20 MR knee, and 20 MR lumbar spine reports, were obtained via PACS. The reports were deidentified and then submitted to ChatGPT-4, with the prompt "Produce an organized and concise layperson summary of the findings of the following radiology report. Target a reading level of 8-9th grade and word count <300 words." Three (two primary and one later added for validation) independent readers evaluated the summaries for completeness and accuracy compared to the original reports. Summaries were rated on a scale of 1 to 3: 1) summaries that were incorrect or incomplete, potentially providing harmful or confusing information; 2) summaries that were mostly correct and complete, unlikely to cause confusion or harm; and 3) summaries that were entirely correct and complete.

Results: All 60 responses met the criteria for word count and readability. Mean ratings for accuracy were 2.58 for reader 1, 2.71 for reader 2, and 2.77 for reader 3. Mean ratings for completeness were 2.87 for reader 1 and 2.73 for reader 2 and 2.87 for reader 3. For accuracy, reader 1 identified three summaries as a 1, reader 2 identified one, and reader 3 identified none. For the two primary readers, inter-reader agreement was low for accuracy (kappa 0.33) and completeness (kappa 0.29). There were no statistically significant changes in inter-reader agreement when the third reader's ratings were included in analysis.

Conclusion: Overall ratings for accuracy and completeness of the AI-generated layperson report summaries were high with only a small minority likely to be confusing or inaccurate. This study illustrates the potential for leveraging generative AI, such as ChatGPT-4, to automate the production of patient-friendly summaries for musculoskeletal MR imaging.

Keywords: Artificial intelligence; ChatGPT; Large language models; Musculoskeletal; Patient education; Report.