Regular expression-based learning to extract bodyweight values from clinical notes

Maureen A Murtaugh; Bryan Smith Gibson; Doug Redd; Qing Zeng-Treitler

doi:10.1016/j.jbi.2015.02.009

Regular expression-based learning to extract bodyweight values from clinical notes

J Biomed Inform. 2015 Apr:54:186-90. doi: 10.1016/j.jbi.2015.02.009. Epub 2015 Mar 5.

Authors

Maureen A Murtaugh¹, Bryan Smith Gibson², Doug Redd³, Qing Zeng-Treitler³

Affiliations

¹ IDEAS Center, Veterans Administration, Salt Lake City Health Care System, Salt Lake City, UT, United States; Department of Internal Medicine, University of Utah School of Medicine, Salt Lake City, UT, United States. Electronic address: Maureen.Murtaugh@hsc.utah.edu.
² IDEAS Center, Veterans Administration, Salt Lake City Health Care System, Salt Lake City, UT, United States; Department of Internal Medicine, University of Utah School of Medicine, Salt Lake City, UT, United States.
³ IDEAS Center, Veterans Administration, Salt Lake City Health Care System, Salt Lake City, UT, United States; Department of Biomedical Informatics, University of Utah School of Medicine, Salt Lake City, UT, United States.

PMID: 25746391
DOI: 10.1016/j.jbi.2015.02.009

Abstract

Background: Bodyweight related measures (weight, height, BMI, abdominal circumference) are extremely important for clinical care, research and quality improvement. These and other vitals signs data are frequently missing from structured tables of electronic health records. However they are often recorded as text within clinical notes. In this project we sought to develop and validate a learning algorithm that would extract bodyweight related measures from clinical notes in the Veterans Administration (VA) Electronic Health Record to complement the structured data used in clinical research.

Methods: We developed the Regular Expression Discovery Extractor (REDEx), a supervised learning algorithm that generates regular expressions from a training set. The regular expressions generated by REDEx were then used to extract the numerical values of interest. To train the algorithm we created a corpus of 268 outpatient primary care notes that were annotated by two annotators. This annotation served to develop the annotation process and identify terms associated with bodyweight related measures for training the supervised learning algorithm. Snippets from an additional 300 outpatient primary care notes were subsequently annotated independently by two reviewers to complete the training set. Inter-annotator agreement was calculated. REDEx was applied to a separate test set of 3561 notes to generate a dataset of weights extracted from text. We estimated the number of unique individuals who would otherwise not have bodyweight related measures recorded in the CDW and the number of additional bodyweight related measures that would be additionally captured.

Results: REDEx's performance was: accuracy=98.3%, precision=98.8%, recall=98.3%, F=98.5%. In the dataset of weights from 3561 notes, 7.7% of notes contained bodyweight related measures that were not available as structured data. In addition 2 additional bodyweight related measures were identified per individual per year.

Conclusion: Bodyweight related measures are frequently stored as text in clinical notes. A supervised learning algorithm can be used to extract this data. Implications for clinical care, epidemiology, and quality improvement efforts are discussed.

Keywords: Bodyweight; Natural language processing; Text classification.

Publication types

Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Body Weight*
Data Curation
Data Mining / methods*
Electronic Health Records*
Humans
Natural Language Processing*
Reproducibility of Results