Evaluating ChatGPT responses in the context of a 53-year-old male with a femoral neck fracture: a qualitative analysis

Yushy Zhou; Charles Moon; Jan Szatkowski; Derek Moore; Jarrad Stevens

doi:10.1007/s00590-023-03742-4

Evaluating ChatGPT responses in the context of a 53-year-old male with a femoral neck fracture: a qualitative analysis

Eur J Orthop Surg Traumatol. 2024 Feb;34(2):927-955. doi: 10.1007/s00590-023-03742-4. Epub 2023 Sep 30.

Authors

Yushy Zhou^{1

2}, Charles Moon³, Jan Szatkowski⁴, Derek Moore⁵, Jarrad Stevens⁶

Affiliations

¹ Department of Surgery, The University of Melbourne, St. Vincent's Hospital Melbourne, 29 Regent Street, Clinical Sciences Block Level 2, Melbourne, VIC, 3010, Australia. yushy.zhou@student.unimelb.edu.au.
² Department of Orthopaedic Surgery, St. Vincent's Hospital, Melbourne, Australia. yushy.zhou@student.unimelb.edu.au.
³ Department of Orthopaedic Surgery, Cedars-Sinai Medical Centre, Los Angeles, CA, USA.
⁴ Department of Orthopaedic Surgery, Indiana University Health Methodist Hospital, Indianapolis, IN, USA.
⁵ Santa Barbara Orthopedic Associates, Santa Barbara, CA, USA.
⁶ Department of Orthopaedic Surgery, St. Vincent's Hospital, Melbourne, Australia.

Abstract

Purpose: The integration of artificial intelligence (AI) tools, such as ChatGPT, in clinical medicine and medical education has gained significant attention due to their potential to support decision-making and improve patient care. However, there is a need to evaluate the benefits and limitations of these tools in specific clinical scenarios.

Methods: This study used a case study approach within the field of orthopaedic surgery. A clinical case report featuring a 53-year-old male with a femoral neck fracture was used as the basis for evaluation. ChatGPT, a large language model, was asked to respond to clinical questions related to the case. The responses generated by ChatGPT were evaluated qualitatively, considering their relevance, justification, and alignment with the responses of real clinicians. Alternative dialogue protocols were also employed to assess the impact of additional prompts and contextual information on ChatGPT responses.

Results: ChatGPT generally provided clinically appropriate responses to the questions posed in the clinical case report. However, the level of justification and explanation varied across the generated responses. Occasionally, clinically inappropriate responses and inconsistencies were observed in the generated responses across different dialogue protocols and on separate days.

Conclusions: The findings of this study highlight both the potential and limitations of using ChatGPT in clinical practice. While ChatGPT demonstrated the ability to provide relevant clinical information, the lack of consistent justification and occasional clinically inappropriate responses raise concerns about its reliability. These results underscore the importance of careful consideration and validation when using AI tools in healthcare. Further research and clinician training are necessary to effectively integrate AI tools like ChatGPT, ensuring their safe and reliable use in clinical decision-making.

Keywords: Artificial intelligence; Chatgpt; Clinical; Decision-making; Large language model; Orthopaedic surgery.

Publication types

Case Reports

MeSH terms

Artificial Intelligence
Clinical Decision-Making
Femoral Neck Fractures* / surgery
Humans
Male
Middle Aged
Orthopedic Procedures*
Reproducibility of Results