Regular expression-based learning to extract bodyweight values from clinical notes

J Biomed Inform. 2015 Apr:54:186-90. doi: 10.1016/j.jbi.2015.02.009. Epub 2015 Mar 5.

Abstract

Background: Bodyweight related measures (weight, height, BMI, abdominal circumference) are extremely important for clinical care, research and quality improvement. These and other vitals signs data are frequently missing from structured tables of electronic health records. However they are often recorded as text within clinical notes. In this project we sought to develop and validate a learning algorithm that would extract bodyweight related measures from clinical notes in the Veterans Administration (VA) Electronic Health Record to complement the structured data used in clinical research.

Methods: We developed the Regular Expression Discovery Extractor (REDEx), a supervised learning algorithm that generates regular expressions from a training set. The regular expressions generated by REDEx were then used to extract the numerical values of interest. To train the algorithm we created a corpus of 268 outpatient primary care notes that were annotated by two annotators. This annotation served to develop the annotation process and identify terms associated with bodyweight related measures for training the supervised learning algorithm. Snippets from an additional 300 outpatient primary care notes were subsequently annotated independently by two reviewers to complete the training set. Inter-annotator agreement was calculated. REDEx was applied to a separate test set of 3561 notes to generate a dataset of weights extracted from text. We estimated the number of unique individuals who would otherwise not have bodyweight related measures recorded in the CDW and the number of additional bodyweight related measures that would be additionally captured.

Results: REDEx's performance was: accuracy=98.3%, precision=98.8%, recall=98.3%, F=98.5%. In the dataset of weights from 3561 notes, 7.7% of notes contained bodyweight related measures that were not available as structured data. In addition 2 additional bodyweight related measures were identified per individual per year.

Conclusion: Bodyweight related measures are frequently stored as text in clinical notes. A supervised learning algorithm can be used to extract this data. Implications for clinical care, epidemiology, and quality improvement efforts are discussed.

Keywords: Bodyweight; Natural language processing; Text classification.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Body Weight*
  • Data Curation
  • Data Mining / methods*
  • Electronic Health Records*
  • Humans
  • Natural Language Processing*
  • Reproducibility of Results