Benchmarking the symptom-checking capabilities of ChatGPT for a broad range of diseases

Anjun Chen; Drake O Chen; Lu Tian

doi:10.1093/jamia/ocad245

Benchmarking the symptom-checking capabilities of ChatGPT for a broad range of diseases

J Am Med Inform Assoc. 2023 Dec 18:ocad245. doi: 10.1093/jamia/ocad245. Online ahead of print.

Authors

Anjun Chen^{1

2}, Drake O Chen², Lu Tian³

Affiliations

¹ Health Sciences, ELHS Institute, Palo Alto, CA 94306, United States.
² LHS Tech Forum Initiative, Learning Health Community, Palo Alto, CA 94306, United States.
³ Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, United States.

PMID: 38109889
DOI: 10.1093/jamia/ocad245

Abstract

Objective: This study evaluates ChatGPT's symptom-checking accuracy across a broad range of diseases using the Mayo Clinic Symptom Checker patient service as a benchmark.

Methods: We prompted ChatGPT with symptoms of 194 distinct diseases. By comparing its predictions with expectations, we calculated a relative comparative score (RCS) to gauge accuracy.

Results: ChatGPT's GPT-4 model achieved an average RCS of 78.8%, outperforming the GPT-3.5-turbo by 10.5%. Some specialties scored above 90%.

Discussion: The test set, although extensive, was not exhaustive. Future studies should include a more comprehensive disease spectrum.

Conclusion: ChatGPT exhibits high accuracy in symptom checking for a broad range of diseases, showcasing its potential as a medical training tool in learning health systems to enhance care quality and address health disparities.

Keywords: ChatGPT; benchmarking; learning health system; medical training; symptom checking.