Unveiling the ChatGPT phenomenon: Evaluating the consistency and accuracy of endodontic question answers

Ana Suárez; Víctor Díaz-Flores García; Juan Algar; Margarita Gómez Sánchez; María Llorente de Pedro; Yolanda Freire

doi:10.1111/iej.13985

Unveiling the ChatGPT phenomenon: Evaluating the consistency and accuracy of endodontic question answers

Int Endod J. 2024 Jan;57(1):108-113. doi: 10.1111/iej.13985. Epub 2023 Oct 9.

Authors

Ana Suárez¹, Víctor Díaz-Flores García¹, Juan Algar², Margarita Gómez Sánchez¹, María Llorente de Pedro¹, Yolanda Freire¹

Affiliations

¹ Department of Pre-Clinic Dentistry, School of Biomedical Sciences, Universidad Europea de Madrid, Madrid, Spain.
² Department of Clinical Dentistry, School of Biomedical Sciences, Universidad Europea de Madrid, Madrid, Spain.

PMID: 37814369
DOI: 10.1111/iej.13985

Abstract

Aim: Chatbot Generative Pre-trained Transformer (ChatGPT) is a generative artificial intelligence (AI) software based on large language models (LLMs), designed to simulate human conversations and generate novel content based on the training data it has been exposed to. The aim of this study was to evaluate the consistency and accuracy of ChatGPT-generated answers to clinical questions in endodontics, compared to answers provided by human experts.

Methodology: Ninety-one dichotomous (yes/no) questions were designed and categorized into three levels of difficulty. Twenty questions were randomly selected from each difficulty level. Sixty answers were generated by ChatGPT for each question. Two endodontic experts independently answered the 60 questions. Statistical analysis was performed using the SPSS program to calculate the consistency and accuracy of the answers generated by ChatGPT compared to the experts. Confidence intervals (95%) and standard deviations were used to estimate variability.

Results: The answers generated by ChatGPT showed high consistency (85.44%). No significant differences in consistency were found based on question difficulty. In terms of answer accuracy, ChatGPT achieved an average accuracy of 57.33%. However, significant differences in accuracy were observed based on question difficulty, with lower accuracy for easier questions.

Conclusions: Currently, ChatGPT is not capable of replacing dentists in clinical decision-making. As ChatGPT's performance improves through deep learning, it is expected to become more useful and effective in the field of endodontics. However, careful attention and ongoing evaluation are needed to ensure its accuracy, reliability and safety in endodontics.

Keywords: ChatGPT; artificial intelligence; chatbot; dentistry; endodontics; large language models.

MeSH terms

Artificial Intelligence*
Clinical Decision-Making
Dental Care
Humans
Reproducibility of Results
Software*