A Computable Phenotype for Acute Respiratory Distress Syndrome Using Natural Language Processing and Machine Learning

Majid Afshar; Cara Joyce; Anthony Oakey; Perry Formanek; Philip Yang; Matthew M Churpek; Richard S Cooper; Susan Zelisko; Ron Price; Dmitriy Dligach

A Computable Phenotype for Acute Respiratory Distress Syndrome Using Natural Language Processing and Machine Learning

AMIA Annu Symp Proc. 2018 Dec 5:2018:157-165. eCollection 2018.

Authors

Majid Afshar^{1

2}, Cara Joyce², Anthony Oakey³, Perry Formanek⁴, Philip Yang⁴, Matthew M Churpek⁵, Richard S Cooper², Susan Zelisko⁶, Ron Price⁶, Dmitriy Dligach^{2

3}

Affiliations

¹ Division of Pulmonary and Critical Care Medicine, Loyola University Medical Center, Maywood, IL.
² Department of Public Health Sciences, Stritch School of Medicine, Loyola University Chicago, Maywood, IL.
³ Department of Computer Science, Loyola University Chicago, Chicago, IL.
⁴ Department of Medicine, Loyola University Medical Center, Maywood, IL.
⁵ Division of Pulmonary and Critical Care Medicine, University of Chicago, Chicago, IL.
⁶ Informatics and Systems Development, Health Sciences Division, Loyola University Chicago, Maywood, IL.

PMID: 30815053
PMCID: PMC6371271

Abstract

Acute Respiratory Distress Syndrome (ARDS) is a syndrome of respiratory failure that may be identified using text from radiology reports. The objective of this study was to determine whether natural language processing (NLP) with machine learning performs better than a traditional keyword model for ARDS identification. Linguistic pre-processing of reports was performed and text features were inputs to machine learning classifiers tuned using 10-fold cross-validation on 80% of the sample size and tested in the remaining 20%. A cohort of 533 patients was evaluated, with a data corpus of 9,255 radiology reports. The traditional model had an accuracy of 67.3% (95% CI: 58.3-76.3) with a positive predictive value (PPV) of 41.7% (95% CI: 27.7-55.6). The best NLP model had an accuracy of 83.0% (95% CI: 75.9-90.2) with a PPV of 71.4% (95% CI: 52.1-90.8). A computable phenotype for ARDS with NLP may identify more cases than the traditional model.

MeSH terms

Adult
Aged
Area Under Curve
Cohort Studies
Diagnosis, Computer-Assisted
Electronic Health Records*
Female
Humans
Length of Stay
Male
Middle Aged
Natural Language Processing*
Predictive Value of Tests
Radiography, Thoracic*
Respiratory Distress Syndrome / diagnosis*
Risk Factors
Supervised Machine Learning*
Unified Medical Language System

Grants and funding

K23 AA024503/AA/NIAAA NIH HHS/United States