Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data

Shengpu Tang; Parmida Davarmanesh; Yanmeng Song; Danai Koutra; Michael W Sjoding; Jenna Wiens

doi:10.1093/jamia/ocaa139

Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data

J Am Med Inform Assoc. 2020 Dec 9;27(12):1921-1934. doi: 10.1093/jamia/ocaa139.

Authors

Shengpu Tang¹, Parmida Davarmanesh², Yanmeng Song³, Danai Koutra¹, Michael W Sjoding^{4

5

6

7}, Jenna Wiens^{1

5

6}

Affiliations

¹ Department of Electrical Engineering and Computer Science, Division of Computer Science and Engineering, University of Michigan, Ann Arbor, USA.
² Department of Mathematics, University of Michigan, Ann Arbor, USA.
³ Department of Statistics, University of Michigan, Ann Arbor, USA.
⁴ Department of Internal Medicine, University of Michigan, Ann Arbor, USA.
⁵ Institution for Healthcare Policy & Innovation, University of Michigan, Ann Arbor, USA.
⁶ Michigan Integrated Center for Health Analytics and Medical Prediction, University of Michigan, Ann Arbor, USA.
⁷ Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, USA.

Abstract

Objective: In applying machine learning (ML) to electronic health record (EHR) data, many decisions must be made before any ML is applied; such preprocessing requires substantial effort and can be labor-intensive. As the role of ML in health care grows, there is an increasing need for systematic and reproducible preprocessing techniques for EHR data. Thus, we developed FIDDLE (Flexible Data-Driven Pipeline), an open-source framework that streamlines the preprocessing of data extracted from the EHR.

Materials and methods: Largely data-driven, FIDDLE systematically transforms structured EHR data into feature vectors, limiting the number of decisions a user must make while incorporating good practices from the literature. To demonstrate its utility and flexibility, we conducted a proof-of-concept experiment in which we applied FIDDLE to 2 publicly available EHR data sets collected from intensive care units: MIMIC-III and the eICU Collaborative Research Database. We trained different ML models to predict 3 clinically important outcomes: in-hospital mortality, acute respiratory failure, and shock. We evaluated models using the area under the receiver operating characteristics curve (AUROC), and compared it to several baselines.

Results: Across tasks, FIDDLE extracted 2,528 to 7,403 features from MIMIC-III and eICU, respectively. On all tasks, FIDDLE-based models achieved good discriminative performance, with AUROCs of 0.757-0.886, comparable to the performance of MIMIC-Extract, a preprocessing pipeline designed specifically for MIMIC-III. Furthermore, our results showed that FIDDLE is generalizable across different prediction times, ML algorithms, and data sets, while being relatively robust to different settings of user-defined arguments.

Conclusions: FIDDLE, an open-source preprocessing pipeline, facilitates applying ML to structured EHR data. By accelerating and standardizing labor-intensive preprocessing, FIDDLE can help stimulate progress in building clinically useful ML tools for EHR data.

Keywords: electronic health records; machine learning; preprocessing pipeline.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Data Management
Data Mining*
Databases, Factual
Electronic Health Records*
Hospital Mortality
Humans
Intensive Care Units
Machine Learning*
ROC Curve
Respiratory Insufficiency
Risk Assessment
Shock

Abstract

Publication types

MeSH terms

Grants and funding