Performance of a Large Language Model in the Generation of Clinical Guidelines for Antibiotic Prophylaxis in Spine Surgery

Bashar Zaidat; Nancy Shrestha; Ashley M Rosenberg; Wasil Ahmed; Rami Rajjoub; Timothy Hoang; Mateo Restrepo Mejia; Akiro H Duey; Justin E Tang; Jun S Kim; Samuel K Cho

doi:10.14245/ns.2347310.655

Performance of a Large Language Model in the Generation of Clinical Guidelines for Antibiotic Prophylaxis in Spine Surgery

Neurospine. 2024 Mar;21(1):128-146. doi: 10.14245/ns.2347310.655. Epub 2024 Mar 31.

Authors

Affiliation

¹ Department of Orthopedic Surgery, Icahn School of Medicine at Mount Sinai, New York, NY, Korea.

Abstract

Objective: Large language models, such as chat generative pre-trained transformer (ChatGPT), have great potential for streamlining medical processes and assisting physicians in clinical decision-making. This study aimed to assess the potential of ChatGPT's 2 models (GPT-3.5 and GPT-4.0) to support clinical decision-making by comparing its responses for antibiotic prophylaxis in spine surgery to accepted clinical guidelines.

Methods: ChatGPT models were prompted with questions from the North American Spine Society (NASS) Evidence-based Clinical Guidelines for Multidisciplinary Spine Care for Antibiotic Prophylaxis in Spine Surgery (2013). Its responses were then compared and assessed for accuracy.

Results: Of the 16 NASS guideline questions concerning antibiotic prophylaxis, 10 responses (62.5%) were accurate in ChatGPT's GPT-3.5 model and 13 (81%) were accurate in GPT-4.0. Twenty-five percent of GPT-3.5 answers were deemed as overly confident while 62.5% of GPT-4.0 answers directly used the NASS guideline as evidence for its response.

Conclusion: ChatGPT demonstrated an impressive ability to accurately answer clinical questions. GPT-3.5 model's performance was limited by its tendency to give overly confident responses and its inability to identify the most significant elements in its responses. GPT-4.0 model's responses had higher accuracy and cited the NASS guideline as direct evidence many times. While GPT-4.0 is still far from perfect, it has shown an exceptional ability to extract the most relevant research available compared to GPT-3.5. Thus, while ChatGPT has shown far-reaching potential, scrutiny should still be exercised regarding its clinical use at this time.

Keywords: Antibiotic prophylaxis; Artificial intelligence; Orthopedic surgery.