Large language models encode clinical knowledge

Karan Singhal; Shekoofeh Azizi; Tao Tu; S Sara Mahdavi; Jason Wei; Hyung Won Chung; Nathan Scales; Ajay Tanwani; Heather Cole-Lewis; Stephen Pfohl; Perry Payne; Martin Seneviratne; Paul Gamble; Chris Kelly; Abubakr Babiker; Nathanael Schärli; Aakanksha Chowdhery; Philip Mansfield; Dina Demner-Fushman; Blaise Agüera Y Arcas; Dale Webster; Greg S Corrado; Yossi Matias; Katherine Chou; Juraj Gottweis; Nenad Tomasev; Yun Liu; Alvin Rajkomar; Joelle Barral; Christopher Semturs; Alan Karthikesalingam; Vivek Natarajan

doi:10.1038/s41586-023-06291-2

Large language models encode clinical knowledge

Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.

Authors

Karan Singhal^#¹, Shekoofeh Azizi^#², Tao Tu^#³, S Sara Mahdavi³, Jason Wei³, Hyung Won Chung³, Nathan Scales³, Ajay Tanwani³, Heather Cole-Lewis³, Stephen Pfohl³, Perry Payne³, Martin Seneviratne³, Paul Gamble³, Chris Kelly³, Abubakr Babiker³, Nathanael Schärli³, Aakanksha Chowdhery³, Philip Mansfield³, Dina Demner-Fushman⁴, Blaise Agüera Y Arcas³, Dale Webster³, Greg S Corrado³, Yossi Matias³, Katherine Chou³, Juraj Gottweis³, Nenad Tomasev⁵, Yun Liu³, Alvin Rajkomar³, Joelle Barral³, Christopher Semturs³, Alan Karthikesalingam⁶, Vivek Natarajan⁷

Affiliations

¹ Google Research, Mountain View, CA, USA. karansinghal@google.com.
² Google Research, Mountain View, CA, USA. shekazizi@google.com.
³ Google Research, Mountain View, CA, USA.
⁴ National Library of Medicine, Bethesda, MD, USA.
⁵ DeepMind, London, UK.
⁶ Google Research, Mountain View, CA, USA. alankarthi@google.com.
⁷ Google Research, Mountain View, CA, USA. natviv@google.com.

^# Contributed equally.

Abstract

Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model¹ (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM² on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA³, MedMCQA⁴, PubMedQA⁵ and Measuring Massive Multitask Language Understanding (MMLU) clinical topics⁶), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.

Publication types

Comparative Study

MeSH terms

Benchmarking*
Bias
Clinical Competence
Comprehension
Computer Simulation*
Datasets as Topic
Knowledge*
Licensure
Medicine* / methods
Medicine* / standards
Natural Language Processing*
Patient Safety
Physicians