A cross-sectional comparative study: ChatGPT 3.5 versus diverse levels of medical experts in the diagnosis of ENT diseases

Mikhael Makhoul; Antoine E Melkane; Patrick El Khoury; Christopher El Hadi; Nayla Matar

doi:10.1007/s00405-024-08509-z

A cross-sectional comparative study: ChatGPT 3.5 versus diverse levels of medical experts in the diagnosis of ENT diseases

Eur Arch Otorhinolaryngol. 2024 May;281(5):2717-2721. doi: 10.1007/s00405-024-08509-z. Epub 2024 Feb 16.

Authors

Mikhael Makhoul¹, Antoine E Melkane², Patrick El Khoury², Christopher El Hadi², Nayla Matar²

Affiliations

¹ Department of Otolaryngology-Head and Neck Surgery, Hotel Dieu de France Hospital, Saint Joseph University, Alfred Naccache Boulevard, Ashrafieh, PO Box: 166830, Beirut, Lebanon. mikhaelmakhoul651@gmail.com.
² Department of Otolaryngology-Head and Neck Surgery, Hotel Dieu de France Hospital, Saint Joseph University, Alfred Naccache Boulevard, Ashrafieh, PO Box: 166830, Beirut, Lebanon.

PMID: 38365990
DOI: 10.1007/s00405-024-08509-z

Abstract

Purpose: With recent advances in artificial intelligence (AI), it has become crucial to thoroughly evaluate its applicability in healthcare. This study aimed to assess the accuracy of ChatGPT in diagnosing ear, nose, and throat (ENT) pathology, and comparing its performance to that of medical experts.

Methods: We conducted a cross-sectional comparative study where 32 ENT cases were presented to ChatGPT 3.5, ENT physicians, ENT residents, family medicine (FM) specialists, second-year medical students (Med2), and third-year medical students (Med3). Each participant provided three differential diagnoses. The study analyzed diagnostic accuracy rates and inter-rater agreement within and between participant groups and ChatGPT.

Results: The accuracy rate of ChatGPT was 70.8%, being not significantly different from ENT physicians or ENT residents. However, a significant difference in correctness rate existed between ChatGPT and FM specialists (49.8%, p < 0.001), and between ChatGPT and medical students (Med2 47.5%, p < 0.001; Med3 47%, p < 0.001). Inter-rater agreement for the differential diagnosis between ChatGPT and each participant group was either poor or fair. In 68.75% of cases, ChatGPT failed to mention the most critical diagnosis.

Conclusions: ChatGPT demonstrated accuracy comparable to that of ENT physicians and ENT residents in diagnosing ENT pathology, outperforming FM specialists, Med2 and Med3. However, it showed limitations in identifying the most critical diagnosis.

Keywords: Artificial intelligence; Case scenarios; ChatGPT; Diagnostic accuracy; ENT; Otolaryngology.

MeSH terms

Artificial Intelligence*
Cross-Sectional Studies
Humans
Neck
Pharyngeal Diseases*
Pharynx