Model development including interactions with multiple imputed data

BMC Med Res Methodol. 2014 Dec 19:14:136. doi: 10.1186/1471-2288-14-136.

Abstract

Background: Multiple imputation is a reliable tool to deal with missing data and is becoming increasingly popular in biostatistics. However, building a model with interactions that are not specified a priori, in the presence of missing data, presents a challenge. On the one hand, the interactions are needed to impute the data, while on the other hand, the data is needed to identify the interactions. The objective of this study was to present a way in which this challenge can be addressed.

Methods: This paper investigates two strategies in which model development with interactions is achieved using a single data set generated from the Expectation Maximization (EM) algorithm. Imputation using both the fully conditional specification approach and the multivariate normal approach is carried out and results are compared. The strategies are illustrated with data from a study of ambient pollution and childhood asthma in Durban, South Africa.

Results: The different approaches to model building and imputation yielded similar results despite the data being mainly categorical. Both strategies investigated for building the model using the multivariate normal imputed data resulted in the identical set of variables and interactions being identified; while models built using data imputed by fully conditional specification were marginally different for the two strategies. It was found that, for both imputation approaches, model building with backward elimination applied to the initial EM data set was easier to implement, and produced good results, compared to those from a complete case analysis.

Conclusions: Developing a predictive model including interactions with data that suffers from missingness is easily done by identifying significant interactions and then applying backward elimination to a single data set imputed from the EM algorithm. It is hoped that this idea can be further developed and, by addressing this practical dilemma, there will be increased adoption of multiple imputation in medical research when data suffers from missingness.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Asthma / chemically induced
  • Asthma / epidemiology*
  • Biomedical Research / statistics & numerical data*
  • Data Interpretation, Statistical*
  • Environmental Exposure / adverse effects*
  • Feeding Behavior
  • Female
  • Humans
  • Male
  • Models, Statistical
  • Research Design
  • South Africa
  • Tobacco Smoke Pollution / adverse effects*

Substances

  • Tobacco Smoke Pollution