ChatGPT vs. web search for patient questions: what does ChatGPT do better?

Sarek A Shen; Carlos A Perez-Heydrich; Deborah X Xie; Jason C Nellis

doi:10.1007/s00405-024-08524-0

ChatGPT vs. web search for patient questions: what does ChatGPT do better?

Eur Arch Otorhinolaryngol. 2024 Jun;281(6):3219-3225. doi: 10.1007/s00405-024-08524-0. Epub 2024 Feb 28.

Authors

Sarek A Shen¹, Carlos A Perez-Heydrich², Deborah X Xie³, Jason C Nellis³

Affiliations

¹ Department of Otolaryngology-Head and Neck Surgery, Johns Hopkins School of Medicine, 601 North Caroline Street, Baltimore, MD, 21287, USA. sarek.shen@gmail.com.
² Johns Hopkins School of Medicine, Baltimore, MD, USA.
³ Department of Otolaryngology-Head and Neck Surgery, Johns Hopkins School of Medicine, 601 North Caroline Street, Baltimore, MD, 21287, USA.

PMID: 38416195
DOI: 10.1007/s00405-024-08524-0

Abstract

Purpose: Chat generative pretrained transformer (ChatGPT) has the potential to significantly impact how patients acquire medical information online. Here, we characterize the readability and appropriateness of ChatGPT responses to a range of patient questions compared to results from traditional web searches.

Methods: Patient questions related to the published Clinical Practice Guidelines by the American Academy of Otolaryngology-Head and Neck Surgery were sourced from existing online posts. Questions were categorized using a modified Rothwell classification system into (1) fact, (2) policy, and (3) diagnosis and recommendations. These were queried using ChatGPT and traditional web search. All results were evaluated on readability (Flesch Reading Ease and Flesch-Kinkaid Grade Level) and understandability (Patient Education Materials Assessment Tool). Accuracy was assessed by two blinded clinical evaluators using a three-point ordinal scale.

Results: 54 questions were organized into fact (37.0%), policy (37.0%), and diagnosis (25.8%). The average readability for ChatGPT responses was lower than traditional web search (FRE: 42.3 ± 13.1 vs. 55.6 ± 10.5, p < 0.001), while the PEMAT understandability was equivalent (93.8% vs. 93.5%, p = 0.17). ChatGPT scored higher than web search for questions the 'Diagnosis' category (p < 0.01); there was no difference in questions categorized as 'Fact' (p = 0.15) or 'Policy' (p = 0.22). Additional prompting improved ChatGPT response readability (FRE 55.6 ± 13.6, p < 0.01).

Conclusions: ChatGPT outperforms web search in answering patient questions related to symptom-based diagnoses and is equivalent in providing medical facts and established policy. Appropriate prompting can further improve readability while maintaining accuracy. Further patient education is needed to relay the benefits and limitations of this technology as a source of medial information.

Keywords: Accessibility; Accuracy; ChatGPT; Large language model; Patient education; Patient questions; Readability.

Publication types

Comparative Study

MeSH terms

Comprehension*
Health Literacy
Humans
Internet*
Patient Education as Topic / methods

Abstract

Publication types

MeSH terms

Grants and funding