Learning to Identify Rare Disease Patients from Electronic Health Records

Rich Colbaugh; Kristin Glass; Christopher Rudolf; Mike Tremblay Volv Global Lausanne Switzerland

Learning to Identify Rare Disease Patients from Electronic Health Records

AMIA Annu Symp Proc. 2018 Dec 5:2018:340-347. eCollection 2018.

Authors

Rich Colbaugh, Kristin Glass, Christopher Rudolf, Mike Tremblay Volv Global Lausanne Switzerland

PMID: 30815073
PMCID: PMC6371307

Abstract

There is increasing interest in developing prediction models capable of identifying rare disease patients in population-scale databases such as electronic health records (EHRs). Deriving these models is challenging for many reasons, perhaps the most important being the limited number of patients with 'gold standard' confirmed diagnoses from which to learn. This paper presents a novel cascade learning methodology which induces accurate prediction models from noisy 'silver standard' labeled data - patients provisionally labeled as positive for the target disease based upon unconfirmed evidence. The algorithm combines unsupervised feature selection, supervised ensemble learning, and unsupervised clustering to enable robust learning from noisy labels. The efficacy of the approach is illustrated through a case study involving the detection of lipodystrophy patients in a country-scale database of EHRs. The case study demonstrates our algorithm outperforms state-of-the-art prediction techniques and permits discovery of previously undiagnosed patients in large EHR databases.

MeSH terms

Algorithms*
Area Under Curve
Cluster Analysis
Electronic Health Records*
Feasibility Studies
Humans
Information Storage and Retrieval / methods
Lipodystrophy / diagnosis*
Models, Biological
Rare Diseases / diagnosis*