An evaluation of GPT models for phenotype concept recognition

Tudor Groza; Harry Caufield; Dylan Gration; Gareth Baynam; Melissa A Haendel; Peter N Robinson; Christopher J Mungall; Justin T Reese

doi:10.1186/s12911-024-02439-w

An evaluation of GPT models for phenotype concept recognition

BMC Med Inform Decis Mak. 2024 Jan 31;24(1):30. doi: 10.1186/s12911-024-02439-w.

Authors

Tudor Groza^{1

2

3

4}, Harry Caufield⁵, Dylan Gration⁶, Gareth Baynam^{7

8

6

9}, Melissa A Haendel¹⁰, Peter N Robinson^{11

12}, Christopher J Mungall⁵, Justin T Reese⁵

Affiliations

¹ Rare Care Centre, Perth Children's Hospital, 15 Hospital Avenue, Nedlands, WA, 6009, Australia. tudor.groza@health.wa.gov.au.
² Telethon Kids Institute, 15 Hospital Avenue, Nedlands, WA, 6009, Australia. tudor.groza@health.wa.gov.au.
³ School of Electrical Engineering, Computing and Mathematical Sciences, Curtin University, Kent St, Bentley, WA, 6102, Australia. tudor.groza@health.wa.gov.au.
⁴ SingHealth Duke-NUS Institute of Precision Medicine, 5 Hospital Drive Level 9, Singapore, 169609, Singapore. tudor.groza@health.wa.gov.au.
⁵ Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA.
⁶ Western Australian Register of Developmental Anomalies, King Edward Memorial Hospital, 374 Bagot Road, Subiaco, WA, 6008, Australia.
⁷ Rare Care Centre, Perth Children's Hospital, 15 Hospital Avenue, Nedlands, WA, 6009, Australia.
⁸ Telethon Kids Institute, 15 Hospital Avenue, Nedlands, WA, 6009, Australia.
⁹ Faculty of Health and Medical Sciences, University of Western Australia, 35 Stirling Hwy, Crawley, WA, 6009, Australia.
¹⁰ University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA.
¹¹ The Jackson Laboratory for Genomic Medicine, Farmington, CT, 06032, USA.
¹² Institute for Systems Genomics, University of Connecticut, Farmington, CT, 06032, USA.

Abstract

Objective: Clinical deep phenotyping and phenotype annotation play a critical role in both the diagnosis of patients with rare disorders as well as in building computationally-tractable knowledge in the rare disorders field. These processes rely on using ontology concepts, often from the Human Phenotype Ontology, in conjunction with a phenotype concept recognition task (supported usually by machine learning methods) to curate patient profiles or existing scientific literature. With the significant shift in the use of large language models (LLMs) for most NLP tasks, we examine the performance of the latest Generative Pre-trained Transformer (GPT) models underpinning ChatGPT as a foundation for the tasks of clinical phenotyping and phenotype annotation.

Materials and methods: The experimental setup of the study included seven prompts of various levels of specificity, two GPT models (gpt-3.5-turbo and gpt-4.0) and two established gold standard corpora for phenotype recognition, one consisting of publication abstracts and the other clinical observations.

Results: The best run, using in-context learning, achieved 0.58 document-level F1 score on publication abstracts and 0.75 document-level F1 score on clinical observations, as well as a mention-level F1 score of 0.7, which surpasses the current best in class tool. Without in-context learning, however, performance is significantly below the existing approaches.

Conclusion: Our experiments show that gpt-4.0 surpasses the state of the art performance if the task is constrained to a subset of the target ontology where there is prior knowledge of the terms that are expected to be matched. While the results are promising, the non-deterministic nature of the outcomes, the high cost and the lack of concordance between different runs using the same prompt and input make the use of these LLMs challenging for this particular task.

Keywords: Artificial intelligence; Generative pretrained transformer; Human Phenotype Ontology; Large language models; Phenotype concept recognition.

MeSH terms

Humans
Knowledge*
Language*
Machine Learning
Phenotype
Rare Diseases

Abstract

MeSH terms

Grants and funding