Leveraging electronic health records for data science: common pitfalls and how to avoid them

Christopher M Sauer; Li-Ching Chen; Stephanie L Hyland; Armand Girbes; Paul Elbers; Leo A Celi

doi:10.1016/S2589-7500(22)00154-6

Leveraging electronic health records for data science: common pitfalls and how to avoid them

Lancet Digit Health. 2022 Dec;4(12):e893-e898. doi: 10.1016/S2589-7500(22)00154-6. Epub 2022 Sep 22.

Authors

Christopher M Sauer¹, Li-Ching Chen², Stephanie L Hyland³, Armand Girbes⁴, Paul Elbers⁴, Leo A Celi⁵

Affiliations

¹ Laboratory for Critical Care Computational Intelligence, Department of Intensive Care Medicine, Amsterdam Medical Data Science, Amsterdam Cardiovascular Science, Amsterdam Institute for Infection and Immunity, Amsterdam UMC, Location VUmc, Amsterdam, Netherlands; Laboratory for Computational Physiology, Institute for Medical Engineering & Science, Massachusetts Institute of Technology, Cambridge, MA, USA. Electronic address: sauerc@mit.edu.
² Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan.
³ Microsoft Research, Cambridge, UK.
⁴ Laboratory for Critical Care Computational Intelligence, Department of Intensive Care Medicine, Amsterdam Medical Data Science, Amsterdam Cardiovascular Science, Amsterdam Institute for Infection and Immunity, Amsterdam UMC, Location VUmc, Amsterdam, Netherlands.
⁵ Laboratory for Computational Physiology, Institute for Medical Engineering & Science, Massachusetts Institute of Technology, Cambridge, MA, USA; Department of Biostatistics, Harvard T H Chan School of Public Health, Boston, MA, USA; Division of Pulmonary, Critical Care and Sleep Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA.

PMID: 36154811
DOI: 10.1016/S2589-7500(22)00154-6

Abstract

Analysis of electronic health records (EHRs) is an increasingly common approach for studying real-world patient data. Use of routinely collected data offers several advantages compared with other study designs, including reduced administrative costs, the ability to update analysis as practice patterns evolve, and larger sample sizes. Methodologically, EHR analysis is subject to distinct challenges because data are not collected for research purposes. In this Viewpoint, we elaborate on the importance of in-depth knowledge of clinical workflows and describe six potential pitfalls to be avoided when working with EHR data, drawing on examples from the literature and our experience. We propose solutions for prevention or mitigation of factors associated with each of these six pitfalls-sample selection bias, imprecise variable definitions, limitations to deployment, variable measurement frequency, subjective treatment allocation, and model overfitting. Ultimately, we hope that this Viewpoint will guide researchers to further improve the methodological robustness of EHR analysis.

Publication types

Review
Research Support, N.I.H., Extramural

MeSH terms

Data Collection
Data Science*
Electronic Health Records*
Humans
Research Design
Routinely Collected Health Data

Grants and funding

R01 EB017205/EB/NIBIB NIH HHS/United States