Handling missing values in healthcare data: A systematic review of deep learning-based imputation techniques

Mingxuan Liu; Siqi Li; Han Yuan; Marcus Eng Hock Ong; Yilin Ning; Feng Xie; Seyed Ehsan Saffari; Yuqing Shang; Victor Volovici; Bibhas Chakraborty; Nan Liu

doi:10.1016/j.artmed.2023.102587

Handling missing values in healthcare data: A systematic review of deep learning-based imputation techniques

Artif Intell Med. 2023 Aug:142:102587. doi: 10.1016/j.artmed.2023.102587. Epub 2023 May 22.

Authors

Affiliations

¹ Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore.
² Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore; Department of Emergency Medicine, Singapore General Hospital, Singapore.
³ Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore; Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore.
⁴ Department of Neurosurgery, Erasmus MC University Medical Center, Rotterdam, the Netherlands.
⁵ Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore; Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore; Department of Statistics and Data Science, National University of Singapore, Singapore; Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA.
⁶ Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore; Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore; SingHealth AI Office, Singapore Health Services, Singapore; Institute of Data Science, National University of Singapore, Singapore. Electronic address: liu.nan@duke-nus.edu.sg.

PMID: 37316097
DOI: 10.1016/j.artmed.2023.102587

Abstract

Objective: The proper handling of missing values is critical to delivering reliable estimates and decisions, especially in high-stakes fields such as clinical research. In response to the increasing diversity and complexity of data, many researchers have developed deep learning (DL)-based imputation techniques. We conducted a systematic review to evaluate the use of these techniques, with a particular focus on the types of data, intending to assist healthcare researchers from various disciplines in dealing with missing data.

Materials and methods: We searched five databases (MEDLINE, Web of Science, Embase, CINAHL, and Scopus) for articles published prior to February 8, 2023 that described the use of DL-based models for imputation. We examined selected articles from four perspectives: data types, model backbones (i.e., main architectures), imputation strategies, and comparisons with non-DL-based methods. Based on data types, we created an evidence map to illustrate the adoption of DL models.

Results: Out of 1822 articles, a total of 111 were included, of which tabular static data (29%, 32/111) and temporal data (40%, 44/111) were the most frequently investigated. Our findings revealed a discernible pattern in the choice of model backbones and data types, for example, the dominance of autoencoder and recurrent neural networks for tabular temporal data. The discrepancy in imputation strategy usage among data types was also observed. The "integrated" imputation strategy, which solves the imputation task simultaneously with downstream tasks, was most popular for tabular temporal data (52%, 23/44) and multi-modal data (56%, 5/9). Moreover, DL-based imputation methods yielded a higher level of imputation accuracy than non-DL methods in most studies.

Conclusion: The DL-based imputation models are a family of techniques, with diverse network structures. Their designation in healthcare is usually tailored to data types with different characteristics. Although DL-based imputation models may not be superior to conventional approaches across all datasets, it is highly possible for them to achieve satisfactory results for a particular data type or dataset. There are, however, still issues with regard to portability, interpretability, and fairness associated with current DL-based imputation models.

Keywords: Deep learning; Healthcare; Imputation; Missing value; Neural networks.

Publication types

Systematic Review
Review

MeSH terms

Databases, Factual
Deep Learning*
MEDLINE
Neural Networks, Computer