Quality and Agreement With Scientific Consensus of ChatGPT Information Regarding Corneal Transplantation and Fuchs Dystrophy

Kayson S Barclay; Jane Y You; Michael J Coleman; Priya M Mathews; Vincent L Ray; Kamran M Riaz; Joaquin O De Rojas; Aaron S Wang; Shelly H Watson; Ellen H Koo; Allen O Eghrari

doi:10.1097/ICO.0000000000003439

Quality and Agreement With Scientific Consensus of ChatGPT Information Regarding Corneal Transplantation and Fuchs Dystrophy

Cornea. 2023 Nov 28. doi: 10.1097/ICO.0000000000003439. Online ahead of print.

Authors

Affiliations

¹ Morgan State University, Baltimore, MD.
² Harvard Medical School, Boston, MA.
³ Cataract and Laser Institute, Mishawaka, IN.
⁴ Center for Sight, Sarasota, FL.
⁵ Fremont Eye Care Physicians, Fremont, CA.
⁶ Dean McGee Eye Institute, Oklahoma City, OK.
⁷ Glaucoma Cataract Consultants, Pittsburgh, PA.
⁸ Northern Virginia Ophthalmology Associates, Falls Church, VA.
⁹ Bascom Palmer Eye Institute, Miami, FL; and.
¹⁰ Wilmer Eye Institute at Johns Hopkins, Baltimore, MD.

PMID: 38016014
DOI: 10.1097/ICO.0000000000003439

Abstract

Purpose: ChatGPT is a commonly used source of information by patients and clinicians. However, it can be prone to error and requires validation. We sought to assess the quality and accuracy of information regarding corneal transplantation and Fuchs dystrophy from 2 iterations of ChatGPT, and whether its answers improve over time.

Methods: A total of 10 corneal specialists collaborated to assess responses of the algorithm to 10 commonly asked questions related to endothelial keratoplasty and Fuchs dystrophy. These questions were asked from both ChatGPT-3.5 and its newer generation, GPT-4. Assessments tested quality, safety, accuracy, and bias of information. Chi-squared, Fisher exact tests, and regression analyses were conducted.

Results: We analyzed 180 valid responses. On a 1 (A+) to 5 (F) scale, the average score given by all specialists across questions was 2.5 for ChatGPT-3.5 and 1.4 for GPT-4, a significant improvement (P < 0.0001). Most responses by both ChatGPT-3.5 (61%) and GPT-4 (89%) used correct facts, a proportion that significantly improved across iterations (P < 0.00001). Approximately a third (35%) of responses from ChatGPT-3.5 were considered against the scientific consensus, a notable rate of error that decreased to only 5% of answers from GPT-4 (P < 0.00001).

Conclusions: The quality of responses in ChatGPT significantly improved between versions 3.5 and 4, and the odds of providing information against the scientific consensus decreased. However, the technology is still capable of producing inaccurate statements. Corneal specialists are uniquely positioned to assist users to discern the veracity and application of such information.