EYE-Llama, an in-domain large language model for ophthalmology

Tania Haghighi; Sina Gholami; Jared Todd Sokol; Enaika Kishnani; Adnan Ahsaniyan; Holakou Rahmanian; Fares Hedayati; Theodore Leng; Minhaj Nur Alam

doi:10.1101/2024.04.26.591355

EYE-Llama, an in-domain large language model for ophthalmology

bioRxiv [Preprint]. 2024 Apr 29:2024.04.26.591355. doi: 10.1101/2024.04.26.591355.

Authors

Tania Haghighi^{1

2}, Sina Gholami¹, Jared Todd Sokol³, Enaika Kishnani¹, Adnan Ahsaniyan², Holakou Rahmanian², Fares Hedayati², Theodore Leng³, Minhaj Nur Alam¹

Affiliations

¹ Department of Electrical Engineering, University of North Carolina at Charlotte, Charlotte, NC, United States.
² Department of Computer Science, Baha'i Institute for Higher Education, Tehran, Iran.
³ Department of Ophthalmology Stanford University School of Medicine, Stanford, CA, United States.

Abstract

Background: Training Large Language Models (LLMs) with in-domain data can significantly enhance their performance, leading to more accurate and reliable question-answering (QA) systems essential for supporting clinical decision-making and educating patients.

Methods: This study introduces LLMs trained on in-domain, well-curated ophthalmic datasets. We also present an open-source substantial ophthalmic language dataset for model training. Our LLMs (EYE-Llama), first pre-trained on an ophthalmology-specific dataset, including paper abstracts, textbooks, EyeWiki, and Wikipedia articles. Subsequently, the models underwent fine-tuning using a diverse range of QA datasets. The LLMs at each stage were then compared to baseline Llama 2, ChatDoctor, and ChatGPT (GPT3.5) models, using four distinct test sets, and evaluated quantitatively (Accuracy, F1 score, and BERTScore) and qualitatively by two ophthalmologists.

Results: Upon evaluating the models using the American Academy of Ophthalmology (AAO) test set and BERTScore as the metric, our models surpassed both Llama 2 and ChatDoctor in terms of F1 score and performed equally to ChatGPT, which was trained with 175 billion parameters (EYE-Llama: 0.57, Llama 2: 0.56, ChatDoctor: 0.56, and ChatGPT: 0.57). When evaluated on the MedMCQA test set, the fine-tuned models demonstrated a higher accuracy compared to the Llama 2 and ChatDoctor models (EYE-Llama: 0.39, Llama 2: 0.33, ChatDoctor: 0.29). However, ChatGPT outperformed EYE-Llama with an accuracy of 0.55. When tested with the PubmedQA set, the fine-tuned model showed improvement in accuracy over both the Llama 2, ChatGPT, and ChatDoctor models (EYE-Llama: 0.96, Llama 2: 0.90, ChatGPT: 0.93, ChatDoctor: 0.92).

Conclusion: The study shows that pre-training and fine-tuning LLMs like EYE-Llama enhances their performance in specific medical domains. Our EYE-Llama models surpass baseline Llama 2 in all evaluations, highlighting the effectiveness of specialized LLMs in medical QA systems. (Funded by NEI R15EY035804 (MNA) and UNC Charlotte Faculty Research Grant (MNA).).

Publication types

Preprint

Grants and funding

R15 EY035804/EY/NEI NIH HHS/United States