Comparison of Prompt Engineering and Fine-Tuning Strategies in Large Language Models in the Classification of Clinical Notes

Xiaodan Zhang; Nabasmita Talukdar; Sandeep Vemulapalli; Sumyeong Ahn; Jiankun Wang; Han Meng; Sardar Mehtab Bin Murtaza; Dmitry Leshchiner; Aakash Ajay Dave; Dimitri F Joseph; Martin Witteveen-Lane; Dave Chesla; Jiayu Zhou; Bin Chen

doi:10.1101/2024.02.07.24302444

Comparison of Prompt Engineering and Fine-Tuning Strategies in Large Language Models in the Classification of Clinical Notes

medRxiv [Preprint]. 2024 Feb 8:2024.02.07.24302444. doi: 10.1101/2024.02.07.24302444.

Authors

Xiaodan Zhang¹, Nabasmita Talukdar¹, Sandeep Vemulapalli^{1

2}, Sumyeong Ahn³, Jiankun Wang³, Han Meng³, Sardar Mehtab Bin Murtaza³, Dmitry Leshchiner¹, Aakash Ajay Dave^{1

4}, Dimitri F Joseph⁵, Martin Witteveen-Lane², Dave Chesla^{2

6}, Jiayu Zhou³, Bin Chen^{1

3

5}

Affiliations

¹ Department of Pediatrics and Human Development, College of Human Medicine, Michigan State University, Grand Rapids, MI, USA.
² Office of Research, Spectrum Health, Grand Rapids, MI, USA.
³ Department of Computer Science and Engineering, College of Engineering, Michigan State University, East Lansing, MI, USA.
⁴ Center for Bioethics and Social Justice, Michigan State University, Grand Rapids, MI, USA.
⁵ Department of Pharmacology and Toxicology, College of Human Medicine, Michigan State University, Grand Rapids, MI, USA.
⁶ Department of Obstetrics, Gynecology and Reproductive Biology, College of Human Medicine, Michigan State University, Grand Rapids, MI, USA.

Abstract

The emerging large language models (LLMs) are actively evaluated in various fields including healthcare. Most studies have focused on established benchmarks and standard parameters; however, the variation and impact of prompt engineering and fine-tuning strategies have not been fully explored. This study benchmarks GPT-3.5 Turbo, GPT-4, and Llama-7B against BERT models and medical fellows' annotations in identifying patients with metastatic cancer from discharge summaries. Results revealed that clear, concise prompts incorporating reasoning steps significantly enhanced performance. GPT-4 exhibited superior performance among all models. Notably, one-shot learning and fine-tuning provided no incremental benefit. The model's accuracy sustained even when keywords for metastatic cancer were removed or when half of the input tokens were randomly discarded. These findings underscore GPT-4's potential to substitute specialized models, such as PubMedBERT, through strategic prompt engineering, and suggest opportunities to improve open-source models, which are better suited to use in clinical settings.

Publication types

Preprint

Abstract

Publication types

Grants and funding