Iterative Prompt Refinement for Mining Gene Relationships from ChatGPT

Yibo Chen; Jeffrey Gao; Marius Petruc; Richard D Hammer; Mihail Popescu; Dong Xu

doi:10.1101/2023.12.23.573201

Iterative Prompt Refinement for Mining Gene Relationships from ChatGPT

bioRxiv [Preprint]. 2023 Dec 23:2023.12.23.573201. doi: 10.1101/2023.12.23.573201.

Authors

Yibo Chen¹, Jeffrey Gao², Marius Petruc¹, Richard D Hammer³, Mihail Popescu⁴, Dong Xu⁵

Affiliations

¹ Institute for Data Science and Informatics, University of Missouri, Columbia, Missouri 65211, USA.
² Marriotts Ridge High School, Marriottsville, MD, 21104, USA.
³ Department of Pathology and Anatomical Sciences, University of Missouri, Columbia, Missouri 65211, USA.
⁴ Department of Biomedical Informatics, Biostatistics and Medical Epidemiology, University of Missouri, Columbia, MO 65211, USA.
⁵ Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, 65211, USA.

Abstract

ChatGPT has demonstrated its potential as a surrogate knowledge graph. Trained on extensive data sources, including open-access publications, peer-reviewed research articles and biomedical websites, ChatGPT extracted information on gene relationships and biological pathways. However, a major challenge is model hallucination, i.e., high false positive rates. To assess and address this challenge, we systematically evaluated ChatGPT's capacity for predicting gene relationships using GPT-3.5-turbo and GPT-4. Benchmarking against the KEGG Pathway Database as the ground truth, we experimented with diverse prompting strategies, targeting gene relationships of activation, inhibition, and phosphorylation. We introduced an innovative iterative prompt refinement technique. By assessing prompt efficacy using metrics like F-1 score, precision, and recall, GPT-4 was re-engaged to suggest improved prompts. A refined prompt, which combines a specialized role with explanatory text, significantly enhances the performance. Going beyond pairwise gene relationships, we also deciphered complex gene interplays, such as gene interaction chains and pathways pertinent to diseases like non-small cell lung cancer. Direct prompts showed limited success, but "least-to-most" prompting exhibited significant potentials for such network constructions. The methods in this study may be used for some other bioinformatics prediction problems.

Keywords: Bioinformatics; ChatGPT; Gene Relation; Knowledge Graph; Prompt Refinement.

Publication types

Preprint

Abstract

Publication types

Grants and funding