Feature selection for unbiased imputation of missing values: A case study in healthcare

Annu Int Conf IEEE Eng Med Biol Soc. 2021 Nov:2021:1911-1915. doi: 10.1109/EMBC46164.2021.9630762.

Abstract

Datasets in healthcare are plagued with incomplete information. Imputation is a common method to deal with missing data where the basic idea is to substitute some reasonable guess for each missing value and then continue with the analysis as if there were no missing data. However unbiased predictions based on imputed datasets can only be guaranteed when the missing mechanism is completely independent of the observed or missing data. Often, this promise is broken in healthcare dataset acquisition due to unintentional errors or response bias of the interviewees. We highlight this issue by studying extensively on an annual health survey dataset on infant mortality prediction and provide a systematic testing for such assumption. We identify such biased features using an empirical approach and show the impact of wrongful inclusion of these features on the predictive performance.Clinical relevance- We show that blind analysis along with plug and play imputation of healthcare data is a potential pitfall that clinicians and researchers want to avoid in finding important markers of disease.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Delivery of Health Care*
  • Humans
  • Research Design*