The Quality of AI-Generated Dental Caries Multiple Choice Questions: A Comparative Analysis of ChatGPT and Google Bard Language Models

Walaa Magdy Ahmed; Amr Ahmed Azhari; Amal Alfaraj; Abdulaziz Alhamadani; Min Zhang; Chang-Tien Lu

doi:10.1016/j.heliyon.2024.e28198

The Quality of AI-Generated Dental Caries Multiple Choice Questions: A Comparative Analysis of ChatGPT and Google Bard Language Models

Heliyon. 2024 Mar 19;10(7):e28198. doi: 10.1016/j.heliyon.2024.e28198. eCollection 2024 Apr 15.

Authors

Walaa Magdy Ahmed¹, Amr Ahmed Azhari¹, Amal Alfaraj², Abdulaziz Alhamadani³, Min Zhang³, Chang-Tien Lu³

Affiliations

¹ Department of Restorative Dentistry, Faculty of Dentistry, King Abdulaziz University, Jeddah, Saudi Arabia.
² Department of Prosthodontics, School of Dentistry, King Faisal Universality, Al Ahsa, Saudi Arabia.
³ Department of Computer Science, Virginia Tech, Northern Virginia Center, USA.

Abstract

Statement of problem: AI technology presents a variety of benefits and challenges for educators.

Purpose: To investigate whether ChatGPT and Google Bard (now is named Gemini) are valuable resources for generating multiple-choice questions for educators of dental caries.

Material and methods: A book on dental caries was used. Sixteen paragraphs were extracted by an expert consultant based on applicability and potential for developing multiple-choice questions. ChatGPT and Bard language models were used to produce multiple-choice questions based on this input, and 64 questions were generated. Three dental specialists assessed the relevance, accuracy, and complexity of the generated questions. The questions were qualitatively evaluated using cognitive learning objectives and item writing flaws. Paired sample t-tests and two-way analysis of variance (ANOVA) were used to compare the generated multiple-choice questions and answers between ChatGPT and Bard.

Results: There were no significant differences between the questions generated by ChatGPT and Bard. Moreover, the analysis of variance found no significant differences in question quality. Bard-generated questions tended to have higher cognitive levels than those of ChatGPT. Format error was predominant in ChatGPT-generated questions. Finally, Bard exhibited more absolute terms than ChatGPT.

Conclusions: ChatGPT and Bard could generate questions related to dental caries, mainly at the cognitive level of knowledge and comprehension.

Clinical significance: Language models are crucial for generating subject-specific questions used in quizzes, tests, and education. By using these models, educators can save time and focus on lesson preparation and student engagement instead of solely focusing on assessment creation. Additionally, language models are adept at generating numerous questions, making them particularly valuable for large-scale exams. However, educators must carefully review and adapt the questions to ensure they align with their learning goals.

Keywords: Bard; ChatGPT; Dental academic assessment; Dental caries; Dental educator; Multiple-choice question.