Evaluation of ChatGPT for Patient Counseling in Kidney Stone Clinic: A Prospective Study

Mohamed Javid; Mahendra Bhandari; P Parameshwari; Madhu Reddiboina; Srikala Prasad

doi:10.1089/end.2023.0571

Evaluation of ChatGPT for Patient Counseling in Kidney Stone Clinic: A Prospective Study

J Endourol. 2024 Apr;38(4):377-383. doi: 10.1089/end.2023.0571. Epub 2024 Feb 27.

Authors

Mohamed Javid¹, Mahendra Bhandari², P Parameshwari³, Madhu Reddiboina⁴, Srikala Prasad¹

Affiliations

¹ Department of Urology, Chengalpattu Medical College, Chengalpattu, Tamil Nadu, India.
² Vattikuti Urology Institute, Henry Ford Hospital, Detroit, Michigan, USA.
³ Department of Community Medicine, Chengalpattu Medical College, Chengalpattu, Tamil Nadu, India.
⁴ RediMinds, Inc., Southfield, Michigan, USA.

PMID: 38411835
DOI: 10.1089/end.2023.0571

Abstract

Introduction: The potential of large language models (LLMs) is to improve the clinical workflow and to make patient care efficient. We prospectively evaluated the performance of the LLM ChatGPT as a patient counseling tool in the urology stone clinic and validated the generated responses with those of urologists. Methods: We collected 61 questions from 12 kidney stone patients and prompted those to ChatGPT and a panel of experienced urologists (Level 1). Subsequently, the blinded responses of urologists and ChatGPT were presented to two expert urologists (Level 2) for comparative evaluation on preset domains: accuracy, relevance, empathy, completeness, and practicality. All responses were rated on a Likert scale of 1 to 10 for psychometric response evaluation. The mean difference in the scores given by the urologists (Level 2) was analyzed and interrater reliability (IRR) for the level of agreement in the responses between the urologists (Level 2) was analyzed by Cohen's kappa. Results: The mean differences in average scores between the responses from ChatGPT and urologists showed significant differences in accuracy (p < 0.001), empathy (p < 0.001), completeness (p < 0.001), and practicality (p < 0.001), except for the relevance domain (p = 0.051), with ChatGPT's responses being rated higher. The IRR analysis revealed significant agreement only in the empathy domain [k = 0.163, (0.059-0.266)]. Conclusion: We believe the introduction of ChatGPT in the clinical workflow could further optimize the information provided to patients in a busy stone clinic. In this preliminary study, ChatGPT supplemented the answers provided by the urologists, adding value to the conversation. However, in its current state, it is still not ready to be a direct source of authentic information for patients. We recommend its use as a source to build a comprehensive Frequently Asked Questions bank as a prelude to developing an LLM Chatbot for patient counseling.

Keywords: ChatGPT; large language models; patient queries; renal stone disease.

MeSH terms

Counseling
Dietary Supplements
Humans
Kidney Calculi*
Prospective Studies
Reproducibility of Results