Are ChatGPT's Free-Text Responses on Periprosthetic Joint Infections of the Hip and Knee Reliable and Useful?

Alexander Draschl; Georg Hauer; Stefan Franz Fischerauer; Angelika Kogler; Lukas Leitner; Dimosthenis Andreou; Andreas Leithner; Patrick Sadoghi

doi:10.3390/jcm12206655

Are ChatGPT's Free-Text Responses on Periprosthetic Joint Infections of the Hip and Knee Reliable and Useful?

J Clin Med. 2023 Oct 20;12(20):6655. doi: 10.3390/jcm12206655.

Authors

Alexander Draschl^{1

2}, Georg Hauer¹, Stefan Franz Fischerauer¹, Angelika Kogler^{1

3}, Lukas Leitner¹, Dimosthenis Andreou¹, Andreas Leithner¹, Patrick Sadoghi¹

Affiliations

¹ Department of Orthopedics and Trauma, Medical University of Graz, Auenbruggerplatz 5, 8036 Graz, Austria.
² Division of Plastic, Aesthetic and Reconstructive Surgery, Department of Surgery, Medical University of Graz, Auenbruggerplatz 29/4, 8036 Graz, Austria.
³ Department of Dermatology and Venereology, Medical University of Graz, Auenbruggerplatz 8, 8036 Graz, Austria.

Abstract

Background: This study aimed to evaluate ChatGPT's performance on questions about periprosthetic joint infections (PJI) of the hip and knee.

Methods: Twenty-seven questions from the 2018 International Consensus Meeting on Musculoskeletal Infection were selected for response generation. The free-text responses were evaluated by three orthopedic surgeons using a five-point Likert scale. Inter-rater reliability (IRR) was assessed via Fleiss' kappa (FK).

Results: Overall, near-perfect IRR was found for disagreement on the presence of factual errors (FK: 0.880, 95% CI [0.724, 1.035], p < 0.001) and agreement on information completeness (FK: 0.848, 95% CI [0.699, 0.996], p < 0.001). Substantial IRR was observed for disagreement on misleading information (FK: 0.743, 95% CI [0.601, 0.886], p < 0.001) and agreement on suitability for patients (FK: 0.627, 95% CI [0.478, 0.776], p < 0.001). Moderate IRR was observed for agreement on "up-to-dateness" (FK: 0.584, 95% CI [0.434, 0.734], p < 0.001) and suitability for orthopedic surgeons (FK: 0.505, 95% CI [0.383, 0.628], p < 0.001). Question- and subtopic-specific analysis revealed diverse IRR levels ranging from near-perfect to poor.

Conclusions: ChatGPT's free-text responses to complex orthopedic questions were predominantly reliable and useful for orthopedic surgeons and patients. Given variations in performance by question and subtopic, consulting additional sources and exercising careful interpretation should be emphasized for reliable medical decision-making.

Keywords: artificial intelligence; hip prosthesis; knee prosthesis; large language model; periprosthetic joint infection.

Grants and funding

This research received no external funding.