Binary acronym disambiguation in clinical notes from electronic health records with an application in computational phenotyping

Nicholas B Link; Sicong Huang; Tianrun Cai; Jiehuan Sun; Kumar Dahal; Lauren Costa; Kelly Cho; Katherine Liao; Tianxi Cai; Chuan Hong; Million Veteran Program

doi:10.1016/j.ijmedinf.2022.104753

Binary acronym disambiguation in clinical notes from electronic health records with an application in computational phenotyping

Int J Med Inform. 2022 Apr 1:162:104753. doi: 10.1016/j.ijmedinf.2022.104753. Online ahead of print.

Authors

Affiliations

¹ VA Boston Healthcare System, Boston, MA, United States; Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, United States. Electronic address: nicklink@g.harvard.edu.
² VA Boston Healthcare System, Boston, MA, United States; Division of Rheumatology, Immunology, and Allergy, Brigham and Women's Hospital, Boston, MA, United States.
³ VA Boston Healthcare System, Boston, MA, United States; Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, United States.
⁴ VA Boston Healthcare System, Boston, MA, United States.

PMID: 35405530
DOI: 10.1016/j.ijmedinf.2022.104753

Abstract

Objective: The use of electronic health records (EHR) systems has grown over the past decade, and with it, the need to extract information from unstructured clinical narratives. Clinical notes, however, frequently contain acronyms with several potential senses (meanings) and traditional natural language processing (NLP) techniques cannot differentiate between these senses. In this study we introduce a semi-supervised method for binary acronym disambiguation, the task of classifying a target sense for acronyms in the clinical EHR notes.

Methods: We developed a semi-supervised ensemble machine learning (CASEml) algorithm to automatically identify when an acronym means a target sense by leveraging semantic embeddings, visit-level text and billing information. The algorithm was validated using note data from the Veterans Affairs hospital system to classify the meaning of three acronyms: RA, MS, and MI. We compared the performance of CASEml against another standard semi-supervised method and a baseline metric selecting the most frequent acronym sense. Along with evaluating the performance of these methods for specific instances of acronyms, we evaluated the impact of acronym disambiguation on NLP-driven phenotyping of rheumatoid arthritis.

Results: CASEml achieved accuracies of 0.947, 0.911, and 0.706 for RA, MS, and MI, respectively, higher than a standard baseline metric and (on average) higher than a state-of-the-art semi-supervised method. As well, we demonstrated that applying CASEml to medical notes improves the AUC of a phenotype algorithm for rheumatoid arthritis.

Conclusion: CASEml is a novel method that accurately disambiguates acronyms in clinical notes and has advantages over commonly used supervised and semi-supervised machine learning approaches. In addition, CASEml improves the performance of NLP tasks that rely on ambiguous acronyms, such as phenotyping.

Keywords: Acronym disambiguation; Electronic health records; Natural language processing; Predictive modeling; Semantic embedding; Unsupervised learning.