A multi-step approach to managing missing data in time and patient variant electronic health records

BMC Res Notes. 2022 Feb 17;15(1):64. doi: 10.1186/s13104-022-05911-w.

Abstract

Objective: Electronic health records (EHR) hold promise for conducting large-scale analyses linking individual characteristics to health outcomes. However, these data often contain a large number of missing values at both the patient and visit level due to variation in data collection across facilities, providers, and clinical need. This study proposes a stepwise framework for imputing missing values within a visit-level EHR dataset that combines informative missingness and conditional imputation in a scalable manner that may be parallelized for efficiency.

Results: For this study we use a subset of data from AMPATH representing information from 530,812 clinic visits from 16,316 Human Immunodeficiency Virus (HIV) positive women across Western Kenya who have given birth. We apply this process to a set of 84 clinical, social and economic variables and are able to impute values for 84.6% of variables with missing data with an average reduction in missing data of approximately 35.6%. We validate the use of this imputed dataset by predicting National Hospital Insurance Fund (NHIF) enrollment with 94.8% accuracy.

Keywords: Big data; Electronic medical records; HIV; Imputation.

MeSH terms

  • Data Collection
  • Electronic Health Records*
  • Female
  • Humans
  • Kenya