Capabilities of GPT-4 in ophthalmology: an analysis of model entropy and progress towards human-level medical question answering

Fares Antaki; Daniel Milad; Mark A Chia; Charles-Édouard Giguère; Samir Touma; Jonathan El-Khoury; Pearse A Keane; Renaud Duval

doi:10.1136/bjo-2023-324438

Capabilities of GPT-4 in ophthalmology: an analysis of model entropy and progress towards human-level medical question answering

Br J Ophthalmol. 2023 Nov 3:bjo-2023-324438. doi: 10.1136/bjo-2023-324438. Online ahead of print.

Authors

Fares Antaki^{1

2

3

4

5}, Daniel Milad^{4

5

6}, Mark A Chia^{1

2}, Charles-Édouard Giguère⁷, Samir Touma^{4

5

6}, Jonathan El-Khoury^{4

5

6}, Pearse A Keane^{8

2

9}, Renaud Duval^{10

6}

Affiliations

¹ Moorfields Eye Hospital NHS Foundation Trust, London, UK.
² Institute of Ophthalmology, UCL, London, UK.
³ The CHUM School of Artificial Intelligence in Healthcare, Montreal, Quebec, Canada.
⁴ Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada.
⁵ Department of Ophthalmology, Centre Hospitalier de l'Universite de Montreal (CHUM), Montreal, Quebec, Canada.
⁶ Department of Ophthalmology, Hopital Maisonneuve-Rosemont, Montreal, Quebec, Canada.
⁷ Institut universitaire en santé mentale de Montréal (IUSMM), Montreal, Quebec, Canada.
⁸ Moorfields Eye Hospital NHS Foundation Trust, London, UK renaud.duval@gmail.com p.keane@ucl.ac.uk.
⁹ NIHR Moorfields Biomedical Research Centre, London, UK.
¹⁰ Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada renaud.duval@gmail.com p.keane@ucl.ac.uk.

PMID: 37923374
DOI: 10.1136/bjo-2023-324438

Abstract

Background: Evidence on the performance of Generative Pre-trained Transformer 4 (GPT-4), a large language model (LLM), in the ophthalmology question-answering domain is needed.

Methods: We tested GPT-4 on two 260-question multiple choice question sets from the Basic and Clinical Science Course (BCSC) Self-Assessment Program and the OphthoQuestions question banks. We compared the accuracy of GPT-4 models with varying temperatures (creativity setting) and evaluated their responses in a subset of questions. We also compared the best-performing GPT-4 model to GPT-3.5 and to historical human performance.

Results: GPT-4-0.3 (GPT-4 with a temperature of 0.3) achieved the highest accuracy among GPT-4 models, with 75.8% on the BCSC set and 70.0% on the OphthoQuestions set. The combined accuracy was 72.9%, which represents an 18.3% raw improvement in accuracy compared with GPT-3.5 (p<0.001). Human graders preferred responses from models with a temperature higher than 0 (more creative). Exam section, question difficulty and cognitive level were all predictive of GPT-4-0.3 answer accuracy. GPT-4-0.3's performance was numerically superior to human performance on the BCSC (75.8% vs 73.3%) and OphthoQuestions (70.0% vs 63.0%), but the difference was not statistically significant (p=0.55 and p=0.09).

Conclusion: GPT-4, an LLM trained on non-ophthalmology-specific data, performs significantly better than its predecessor on simulated ophthalmology board-style exams. Remarkably, its performance tended to be superior to historical human performance, but that difference was not statistically significant in our study.

Keywords: Medical Education.

Grants and funding

MR/T019050/1/MRC_/Medical Research Council/United Kingdom