Identifying and Extracting Rare Diseases and Their Phenotypes with Large Language Models

Cathy Shyr; Yan Hu; Lisa Bastarache; Alex Cheng; Rizwan Hamid; Paul Harris; Hua Xu

doi:10.1007/s41666-023-00155-0

Identifying and Extracting Rare Diseases and Their Phenotypes with Large Language Models

J Healthc Inform Res. 2024 Jan 5;8(2):438-461. doi: 10.1007/s41666-023-00155-0. eCollection 2024 Jun.

Authors

Cathy Shyr¹, Yan Hu², Lisa Bastarache¹, Alex Cheng¹, Rizwan Hamid³, Paul Harris^{1

4

5}, Hua Xu⁶

Affiliations

¹ Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203 USA.
² School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77225 USA.
³ Division of Medical Genetics and Genomic Medicine, Vanderbilt University Medical Center, Nashville, TN 37203 USA.
⁴ Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203 USA.
⁵ Department of Biomedical Engineering, Vanderbilt University Medical Center, 2525 West End Avenue, Nashville, TN 37203 USA.
⁶ Section of Biomedical Informatics and Data Science, Yale School of Medicine, 100 College Street, New Haven, CT 06510 USA.

Abstract

Purpose: Phenotyping is critical for informing rare disease diagnosis and treatment, but disease phenotypes are often embedded in unstructured text. While natural language processing (NLP) can automate extraction, a major bottleneck is developing annotated corpora. Recently, prompt learning with large language models (LLMs) has been shown to lead to generalizable results without any (zero-shot) or few annotated samples (few-shot), but none have explored this for rare diseases. Our work is the first to study prompt learning for identifying and extracting rare disease phenotypes in the zero- and few-shot settings.

Methods: We compared the performance of prompt learning with ChatGPT and fine-tuning with BioClinicalBERT. We engineered novel prompts for ChatGPT to identify and extract rare diseases and their phenotypes (e.g., diseases, symptoms, and signs), established a benchmark for evaluating its performance, and conducted an in-depth error analysis.

Results: Overall, fine-tuning BioClinicalBERT resulted in higher performance (F1 of 0.689) than ChatGPT (F1 of 0.472 and 0.610 in the zero- and few-shot settings, respectively). However, ChatGPT achieved higher accuracy for rare diseases and signs in the one-shot setting (F1 of 0.778 and 0.725). Conversational, sentence-based prompts generally achieved higher accuracy than structured lists.

Conclusion: Prompt learning using ChatGPT has the potential to match or outperform fine-tuning BioClinicalBERT at extracting rare diseases and signs with just one annotated sample. Given its accessibility, ChatGPT could be leveraged to extract these entities without relying on a large, annotated corpus. While LLMs can support rare disease phenotyping, researchers should critically evaluate model outputs to ensure phenotyping accuracy.

Keywords: Artificial intelligence; ChatGPT; Large language model; Natural language processing; Prompt learning; Rare disease.