Extracting and integrating data from entire electronic health records for detecting colorectal cancer cases

Hua Xu; Zhenming Fu; Anushi Shah; Yukun Chen; Neeraja B Peterson; Qingxia Chen; Subramani Mani; Mia A Levy; Qi Dai; Josh C Denny

Extracting and integrating data from entire electronic health records for detecting colorectal cancer cases

AMIA Annu Symp Proc. 2011:2011:1564-72. Epub 2011 Oct 22.

Authors

Hua Xu¹, Zhenming Fu, Anushi Shah, Yukun Chen, Neeraja B Peterson, Qingxia Chen, Subramani Mani, Mia A Levy, Qi Dai, Josh C Denny

Affiliation

¹ Department of Biomedical Informatics, Vanderbilt University, School of Medicine, Nashville, TN, USA.

PMID: 22195222
PMCID: PMC3243156

Abstract

Identification of a cohort of patients with specific diseases is an important step for clinical research that is based on electronic health records (EHRs). Informatics approaches combining structured EHR data, such as billing records, with narrative text data have demonstrated utility for such tasks. This paper describes an algorithm combining machine learning and natural language processing to detect patients with colorectal cancer (CRC) from entire EHRs at Vanderbilt University Hospital. We developed a general case detection method that consists of two steps: 1) extraction of positive CRC concepts from all clinical notes (document-level concept identification); and 2) determination of CRC cases using aggregated information from both clinical narratives and structured billing data (patient-level case determination). For each step, we compared performance of rule-based and machine-learning-based approaches. Using a manually reviewed data set containing 300 possible CRC patients (150 for training and 150 for testing), we showed that our method achieved F-measures of 0.996 for document level concept identification, and 0.93 for patient level case detection.

Publication types

Comparative Study
Research Support, N.I.H., Extramural

MeSH terms

Algorithms*
Artificial Intelligence*
Colorectal Neoplasms / diagnosis*
Data Mining / methods*
Electronic Health Records*
Humans
Natural Language Processing

Abstract

Publication types

MeSH terms

Grants and funding