ChatGPT sits the DFPH exam: large language model performance and potential to support public health learning

Nathan P Davies; Robert Wilson; Madeleine S Winder; Simon J Tunster; Kathryn McVicar; Shivan Thakrar; Joe Williams; Allan Reid

doi:10.1186/s12909-024-05042-9

ChatGPT sits the DFPH exam: large language model performance and potential to support public health learning

BMC Med Educ. 2024 Jan 11;24(1):57. doi: 10.1186/s12909-024-05042-9.

Authors

Nathan P Davies¹, Robert Wilson², Madeleine S Winder³, Simon J Tunster³, Kathryn McVicar³, Shivan Thakrar⁴, Joe Williams⁵, Allan Reid²

Affiliations

¹ Nottingham Centre for Public Health and Epidemiology, University of Nottingham, Nottingham City Hospital, Hucknall Rd, Nottingham, NG5 1PB, England. Nathan.davies@nottingham.ac.uk.
² NHS England, Seaton House, City Link, London Road, Nottingham, NG2 4LA, England.
³ Nottingham Centre for Public Health and Epidemiology, University of Nottingham, Nottingham City Hospital, Hucknall Rd, Nottingham, NG5 1PB, England.
⁴ Leicester City Council, Public Health, 115 Charles Street, Leicester, LE1 1FZ, England.
⁵ School of Health and Related Research (ScHARR), The University of Sheffield, 30 Regent St, Sheffield, S1 4DA, England.

Abstract

Background: Artificial intelligence-based large language models, like ChatGPT, have been rapidly assessed for both risks and potential in health-related assessment and learning. However, their applications in public health professional exams have not yet been studied. We evaluated the performance of ChatGPT in part of the Faculty of Public Health's Diplomat exam (DFPH).

Methods: ChatGPT was provided with a bank of 119 publicly available DFPH question parts from past papers. Its performance was assessed by two active DFPH examiners. The degree of insight and level of understanding apparently displayed by ChatGPT was also assessed.

Results: ChatGPT passed 3 of 4 papers, surpassing the current pass rate. It performed best on questions relating to research methods. Its answers had a high floor. Examiners identified ChatGPT answers with 73.6% accuracy and human answers with 28.6% accuracy. ChatGPT provided a mean of 3.6 unique insights per question and appeared to demonstrate a required level of learning on 71.4% of occasions.

Conclusions: Large language models have rapidly increasing potential as a learning tool in public health education. However, their factual fallibility and the difficulty of distinguishing their responses from that of humans pose potential threats to teaching and learning.

Keywords: Artificial intelligence; Examination; Public health; Theory.

MeSH terms

Artificial Intelligence*
Health Education
Humans
Language
Learning
Public Health*