Assessing the medical reasoning skills of GPT-4 in complex ophthalmology cases

Daniel Milad; Fares Antaki; Jason Milad; Andrew Farah; Thomas Khairy; David Mikhail; Charles-Édouard Giguère; Samir Touma; Allison Bernstein; Andrei-Alexandru Szigiato; Taylor Nayman; Guillaume A Mullie; Renaud Duval

doi:10.1136/bjo-2023-325053

Assessing the medical reasoning skills of GPT-4 in complex ophthalmology cases

Br J Ophthalmol. 2024 Feb 16:bjo-2023-325053. doi: 10.1136/bjo-2023-325053. Online ahead of print.

Authors

Daniel Milad^{1

2}, Fares Antaki^{1

3

4}, Jason Milad⁵, Andrew Farah⁶, Thomas Khairy⁶, David Mikhail⁷, Charles-Édouard Giguère⁸, Samir Touma^{1

2}, Allison Bernstein^{1

2}, Andrei-Alexandru Szigiato^{1

9}, Taylor Nayman^{1

2}, Guillaume A Mullie^{1

10}, Renaud Duval^{11

2}

Affiliations

¹ Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada.
² Department of Ophthalmology, Hôpital Maisonneuve-Rosement, Montreal, Quebec, Canada.
³ Institute of Ophthalmology, University College London, London, UK.
⁴ CHUM School of Artificial Intelligence in Healthcare (SAIH), Centre Hospitalier de l'Université de Montréal (CHUM), Montreal, Quebec, Canada.
⁵ Department of Software Engineering, University of Waterloo, Waterloo, Ontario, Canada.
⁶ Faculty of Medicine, McGill University, Montreal, Quebec, Canada.
⁷ Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada.
⁸ Centre de recherche de l'Institut universitaire en santé mentale de Montréal, Montréal, Quebec, Canada.
⁹ Department of Ophthalmology, Hôpital du Sacré-Coeur de Montréal, Montreal, Quebec, Canada.
¹⁰ Department of Ophthalmology, Cité-de-la-Santé Hospital, Laval, Quebec, Canada.
¹¹ Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada renaud.duval@gmail.com.

PMID: 38365427
DOI: 10.1136/bjo-2023-325053

Abstract

Background/aims: This study assesses the proficiency of Generative Pre-trained Transformer (GPT)-4 in answering questions about complex clinical ophthalmology cases.

Methods: We tested GPT-4 on 422 Journal of the American Medical Association Ophthalmology Clinical Challenges, and prompted the model to determine the diagnosis (open-ended question) and identify the next-step (multiple-choice question). We generated responses using two zero-shot prompting strategies, including zero-shot plan-and-solve+ (PS+), to improve the reasoning of the model. We compared the best-performing model to human graders in a benchmarking effort.

Results: Using PS+ prompting, GPT-4 achieved mean accuracies of 48.0% (95% CI (43.1% to 52.9%)) and 63.0% (95% CI (58.2% to 67.6%)) in diagnosis and next step, respectively. Next-step accuracy did not significantly differ by subspecialty (p=0.44). However, diagnostic accuracy in pathology and tumours was significantly higher than in uveitis (p=0.027). When the diagnosis was accurate, 75.2% (95% CI (68.6% to 80.9%)) of the next steps were correct. Conversely, when the diagnosis was incorrect, 50.2% (95% CI (43.8% to 56.6%)) of the next steps were accurate. The next step was three times more likely to be accurate when the initial diagnosis was correct (p<0.001). No significant differences were observed in diagnostic accuracy and decision-making between board-certified ophthalmologists and GPT-4. Among trainees, senior residents outperformed GPT-4 in diagnostic accuracy (p≤0.001 and 0.049) and in accuracy of next step (p=0.002 and 0.020).

Conclusion: Improved prompting enhances GPT-4's performance in complex clinical situations, although it does not surpass ophthalmology trainees in our context. Specialised large language models hold promise for future assistance in medical decision-making and diagnosis.