One LLM is not Enough: Harnessing the Power of Ensemble Learning for Medical Question Answering

Han Yang; Mingchen Li; Huixue Zhou; Yongkang Xiao; Qian Fang; Rui Zhang

doi:10.1101/2023.12.21.23300380

One LLM is not Enough: Harnessing the Power of Ensemble Learning for Medical Question Answering

medRxiv [Preprint]. 2023 Dec 24:2023.12.21.23300380. doi: 10.1101/2023.12.21.23300380.

Authors

Han Yang¹, Mingchen Li², Huixue Zhou¹, Yongkang Xiao¹, Qian Fang³, Rui Zhang²

Affiliations

¹ Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA.
² Division of Computational Health Sciences, Department of Surgery, University of Minnesota, Minneapolis, MN, USA.
³ H. Milton Stewart School of Industrial & Systems Engineering, Georgia Institute of Technology, Atlanta, GA, USA.

Abstract

Objective: To enhance the accuracy and reliability of diverse medical question-answering (QA) tasks and investigate efficient approaches deploying the Large Language Models (LLM) technologies, We developed a novel ensemble learning pipeline by utilizing state-of-the-art LLMs, focusing on improving performance on diverse medical QA datasets.

Materials and methods: Our study employs three medical QA datasets: PubMedQA, MedQA-USMLE, and MedMCQA, each presenting unique challenges in biomedical question-answering. The proposed LLM-Synergy framework, focusing exclusively on zero-shot cases using LLMs, incorporates two primary ensemble methods. The first is a Boosting-based weighted majority vote ensemble, where decision-making is expedited and refined by assigning variable weights to different LLMs through a boosting algorithm. The second method is Cluster-based Dynamic Model Selection, which dynamically selects the most suitable LLM votes for each query, based on the characteristics of question contexts, using a clustering approach.

Results: The Majority Weighted Vote and Dynamic Model Selection methods demonstrate superior performance compared to individual LLMs across three medical QA datasets. Specifically, the accuracies are 35.84%, 96.21%, and 37.26% for MedMCQA, PubMedQA, and MedQA-USMLE, respectively, with the Majority Weighted Vote. Correspondingly, the Dynamic Model Selection yields slightly higher accuracies of 38.01%, 96.36%, and 38.13%.

Conclusion: The LLM-Synergy framework with two ensemble methods, represents a significant advancement in leveraging LLMs for medical QA tasks and provides an innovative way of efficiently utilizing the development with LLM Technologies, customing for both existing and potentially future challenge tasks in biomedical and health informatics research.

Keywords: Ensemble Learning; Healthcare AI; Large Language Models; Medical Question Answering.

Publication types

Preprint

Abstract

Publication types

Grants and funding