Appropriateness and Readability of ChatGPT-4-Generated Responses for Surgical Treatment of Retinal Diseases

Bita Momenaei; Taku Wakabayashi; Abtin Shahlaee; Asad F Durrani; Saagar A Pandit; Kristine Wang; Hana A Mansour; Robert M Abishek; David Xu; Jayanth Sridhar; Yoshihiro Yonekawa; Ajay E Kuriyan

doi:10.1016/j.oret.2023.05.022

Appropriateness and Readability of ChatGPT-4-Generated Responses for Surgical Treatment of Retinal Diseases

Ophthalmol Retina. 2023 Oct;7(10):862-868. doi: 10.1016/j.oret.2023.05.022. Epub 2023 Jun 3.

Affiliations

¹ Wills Eye Hospital, Mid Atlantic Retina, Thomas Jefferson University, Philadelphia, Pennsylvania.
² Bascom Palmer Eye Institute, University of Miami Miller School of Medicine, Miami, Florida.
³ Wills Eye Hospital, Mid Atlantic Retina, Thomas Jefferson University, Philadelphia, Pennsylvania. Electronic address: ajay.kuriyan@gmail.com.

PMID: 37277096
DOI: 10.1016/j.oret.2023.05.022

Abstract

Objective: To evaluate the appropriateness and readability of the medical knowledge provided by ChatGPT-4, an artificial intelligence-powered conversational search engine, regarding common vitreoretinal surgeries for retinal detachments (RDs), macular holes (MHs), and epiretinal membranes (ERMs).

Design: Retrospective cross-sectional study.

Subjects: This study did not involve any human participants.

Methods: We created lists of common questions about the definition, prevalence, visual impact, diagnostic methods, surgical and nonsurgical treatment options, postoperative information, surgery-related complications, and visual prognosis of RD, MH, and ERM, and asked each question 3 times on the online ChatGPT-4 platform. The data for this cross-sectional study were recorded on April 25, 2023. Two independent retina specialists graded the appropriateness of the responses. Readability was assessed using Readable, an online readability tool.

Main outcome measures: The "appropriateness" and "readability" of the answers generated by ChatGPT-4 bot.

Results: Responses were consistently appropriate in 84.6% (33/39), 92% (23/25), and 91.7% (22/24) of the questions related to RD, MH, and ERM, respectively. Answers were inappropriate at least once in 5.1% (2/39), 8% (2/25), and 8.3% (2/24) of the respective questions. The average Flesch Kincaid Grade Level and Flesch Reading Ease Score were 14.1 ± 2.6 and 32.3 ± 10.8 for RD, 14 ± 1.3 and 34.4 ± 7.7 for MH, and 14.8 ± 1.3 and 28.1 ± 7.5 for ERM. These scores indicate that the answers are difficult or very difficult to read for the average lay person and college graduation would be required to understand the material.

Conclusions: Most of the answers provided by ChatGPT-4 were consistently appropriate. However, ChatGPT and other natural language models in their current form are not a source of factual information. Improving the credibility and readability of responses, especially in specialized fields, such as medicine, is a critical focus of research. Patients, physicians, and laypersons should be advised of the limitations of these tools for eye- and health-related counseling.

Financial disclosure(s): Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.

Keywords: Artificial intelligence; ChatGPT; Readability; Retina; Vitrectomy.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Artificial Intelligence
Comprehension
Cross-Sectional Studies
Health Literacy*
Humans
Retinal Diseases* / surgery
Retrospective Studies