Evaluating large language models as agents in the clinic

Nikita Mehandru; Brenda Y Miao; Eduardo Rodriguez Almaraz; Madhumita Sushil; Atul J Butte; Ahmed Alaa

doi:10.1038/s41746-024-01083-y

Evaluating large language models as agents in the clinic

NPJ Digit Med. 2024 Apr 3;7(1):84. doi: 10.1038/s41746-024-01083-y.

Authors

Nikita Mehandru^#¹, Brenda Y Miao^#², Eduardo Rodriguez Almaraz^#^{3

4}, Madhumita Sushil², Atul J Butte^{2

5}, Ahmed Alaa^{6

7}

Affiliations

¹ University of California, Berkeley, 2195 Hearst Ave, Warren Hall Suite, 120C, Berkeley, CA, USA.
² Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, CA, USA.
³ Neurosurgery Department Division of Neuro-Oncology, University of California San Francisco, 400 Parnassus Avenue, 8th floor, RM A808, San Francisco, CA, USA.
⁴ Department of Epidemiology and Biostatistics, University of California San Francisco, 400 Parnassus Avenue, 8th floor, RM A808, San Francisco, CA, USA.
⁵ Department of Pediatrics, University of California San Francisco, San Francisco, CA, USA.
⁶ University of California, Berkeley, 2195 Hearst Ave, Warren Hall Suite, 120C, Berkeley, CA, USA. amalaa@berkeley.edu.
⁷ Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, CA, USA. amalaa@berkeley.edu.

^# Contributed equally.

Abstract

Recent developments in large language models (LLMs) have unlocked opportunities for healthcare, from information synthesis to clinical decision support. These LLMs are not just capable of modeling language, but can also act as intelligent “agents” that interact with stakeholders in open-ended conversations and even influence clinical decision-making. Rather than relying on benchmarks that measure a model’s ability to process clinical data or answer standardized test questions, LLM agents can be modeled in high-fidelity simulations of clinical settings and should be assessed for their impact on clinical workflows. These evaluation frameworks, which we refer to as “Artificial Intelligence Structured Clinical Examinations” (“AI-SCE”), can draw from comparable technologies where machines operate with varying degrees of self-governance, such as self-driving cars, in dynamic environments with multiple stakeholders. Developing these robust, real-world clinical evaluations will be crucial towards deploying LLM agents in medical settings.

Abstract

Grants and funding