ChatGPT4 outperforms endoscopists for determination of post-colonoscopy re-screening and surveillance recommendations

Clin Gastroenterol Hepatol. 2024 May 8:S1542-3565(24)00429-4. doi: 10.1016/j.cgh.2024.04.022. Online ahead of print.

Abstract

Background: Large language models (LLM) including ChatGPT4 improve access to artificial intelligence, but their impact on the clinical practice of gastroenterology is undefined. In this study, we aim to compare the accuracy, concordance and reliability of ChatGPT4 colonoscopy recommendations for colorectal cancer re-screening and surveillance to contemporary guidelines and real-world gastroenterology practice.

Methods: History of present illness, colonoscopy data and pathology reports from patients undergoing procedures at two large academic centers were entered into ChatGPT4 and it was queried for next recommended colonoscopy follow-up interval. Using McNemar's test and inter-rater reliability, we compared the recommendations made by ChatGPT4 with the actual surveillance interval provided in the endoscopist's procedure report (gastroenterology practice) and the appropriate USMSTF guidance. The latter was generated for each case by an expert panel using the clinical information and guideline documents as reference.

Results: Text input of de-identified data into ChatGPT4 from 505 consecutive patients undergoing colonoscopy between January 1st and April 30th, 2023 elicited a successful follow-up recommendation in 99.2% of the queries. ChatGPT4 recommendations were in closer agreement with the USMSTF Panel (85.7%) than gastroenterology practice recommendations with the USMSTF Panel (75.4%) (P<.001). Of the 14.3% discordant recommendations between ChatGPT4 and USMSTF Panel, recommendations were for later screening in 26 (5.1%) and earlier screening in 44 (8.7%) cases. The inter-rater reliability was good for ChatGPT4 vs. USMSTF Panel (Fleiss κ: 0.786, CI95%: 0.734-0.838, P<.001).

Conclusions: Initial real-world results suggest that ChatGPT4 can accurately define routine colonoscopy screening intervals based on verbatim input of clinical data. LLM have potential for clinical applications, but further training is needed for broad use.

Keywords: Artificial Intelligence; ChatGPT4; Colorectal Neoplasms; Large Language Model.