Generative Artificial Intelligence Performs at a Second-Year Orthopedic Resident Level

Zachary C Lum; Dylon P Collins; Stanley Dennison; Lohitha Guntupalli; Soham Choudhary; Augustine M Saiz; Robert L Randall

doi:10.7759/cureus.56104

Generative Artificial Intelligence Performs at a Second-Year Orthopedic Resident Level

Cureus. 2024 Mar 13;16(3):e56104. doi: 10.7759/cureus.56104. eCollection 2024 Mar.

Authors

Zachary C Lum^{1

2}, Dylon P Collins³, Stanley Dennison³, Lohitha Guntupalli⁴, Soham Choudhary⁵, Augustine M Saiz⁶, Robert L Randall⁶

Affiliations

¹ Orthopedic Surgery, University of California (UC) Davis School of Medicine, Sacramento, USA.
² Orthopedic Surgery, Nova Southeastern University, Pembroke Pines, USA.
³ College of Medicine, Nova Southeastern University Dr. Kiran C. Patel College of Osteopathic Medicine, Fort Lauderdale, USA.
⁴ Osteopathic Medicine, Nova Southeastern University Dr. Kiran C. Patel College of Osteopathic Medicine, Clearwater, USA.
⁵ Orthopedic Surgery, University of California, Davis, Davis, USA.
⁶ Orthopedic Surgery, University of California (UC) Davis Health, Sacramento, USA.

Abstract

Introduction Artificial intelligence (AI) models using large language models (LLMs) and non-specific domains have gained attention for their innovative information processing. As AI advances, it's essential to regularly evaluate these tools' competency to maintain high standards, prevent errors or biases, and avoid flawed reasoning or misinformation that could harm patients or spread inaccuracies. Our study aimed to determine the performance of Chat Generative Pre-trained Transformer (ChatGPT) by OpenAI and Google BARD (BARD) in orthopedic surgery, assess performance based on question types, contrast performance between different AIs and compare AI performance to orthopedic residents. Methods We administered ChatGPT and BARD 757 Orthopedic In-Training Examination (OITE) questions. After excluding image-related questions, the AIs answered 390 multiple choice questions, all categorized within 10 sub-specialties (basic science, trauma, sports medicine, spine, hip and knee, pediatrics, oncology, shoulder and elbow, hand, and food and ankle) and three taxonomy classes (recall, interpretation, and application of knowledge). Statistical analysis was performed to analyze the number of questions answered correctly by each AI model, the performance returned by each AI model within the categorized question sub-specialty designation, and the performance of each AI model in comparison to the results returned by orthopedic residents classified by their respective post-graduate year (PGY) level. Results BARD answered more overall questions correctly (58% vs 54%, p<0.001). ChatGPT performed better in sports medicine and basic science and worse in hand surgery, while BARD performed better in basic science (p<0.05). The AIs performed better in recall questions compared to the application of knowledge (p<0.05). Based on previous data, it ranked in the 42nd-96th percentile for post-graduate year ones (PGY1s), 27th-58th for PGY2s, 3rd-29th for PGY3s, 1st-21st for PGY4s, and 1st-17th for PGY5s. Discussion ChatGPT excelled in sports medicine but fell short in hand surgery, while both AIs performed well in the basic science sub-specialty but performed poorly in the application of knowledge-based taxonomy questions. BARD performed better than ChatGPT overall. Although the AI reached the second-year PGY orthopedic resident level, it fell short of passing the American Board of Orthopedic Surgery (ABOS). Its strengths in recall-based inquiries highlight its potential as an orthopedic learning and educational tool.

Keywords: chatgpt; generative artificial intelligence; google bard; oite; orthopaedic surgery.