A bootstrapping algorithm to improve cohort identification using structured data

Sasikiran Kandula; Qing Zeng-Treitler; Lingji Chen; William L Salomon; Bruce E Bray

doi:10.1016/j.jbi.2011.10.013

A bootstrapping algorithm to improve cohort identification using structured data

J Biomed Inform. 2011 Dec:44 Suppl 1:S63-S68. doi: 10.1016/j.jbi.2011.10.013. Epub 2011 Nov 7.

Authors

Sasikiran Kandula¹, Qing Zeng-Treitler², Lingji Chen³, William L Salomon⁴, Bruce E Bray¹

Affiliations

¹ Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, United States.
² Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, United States. Electronic address: qing.zeng@utah.edu.
³ Scientific Systems Company Inc., Woburn, MA, United States.
⁴ Clinical Metrics LLC, Poland, ME, United States.

PMID: 22079803
DOI: 10.1016/j.jbi.2011.10.013

Abstract

Cohort identification is an important step in conducting clinical research studies. Use of ICD-9 codes to identify disease cohorts is a common approach that can yield satisfactory results in certain conditions; however, for many use-cases more accurate methods are required. In this study, we propose a bootstrapping method that supplements ICD-9 codes with lab results, medications, etc. to build classification models that can be used to identify cohorts more accurately. The proposed method does not require prior information about the true class of the patients. We used the method to identify Diabetes Mellitus (DM) and Hyperlipidemia (HL) patient cohorts from a database of 800 thousand patients. Evaluation results show that the method identified 11,000 patients who did not have DM related ICD-9 codes as positive for DM and 52,000 patients without HL codes as positive for HL. A review of 400 patient charts (200 patients for each condition) by two clinicians shows that in both the conditions studied, the labeling assigned by the proposed approach is more consistent with that of the clinicians compared to labeling through ICD-9 codes. The method is reasonably automated and, we believe, holds potential for inexpensive, more accurate cohort identification.

Published by Elsevier Inc.

Publication types

Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms*
Cohort Studies*
Databases, Factual*
Diabetes Mellitus / classification
Diabetes Mellitus / diagnosis
Humans
Hyperlipidemias / classification
Hyperlipidemias / diagnosis
International Classification of Diseases / standards*