Incomplete clustering analysis via multiple imputation

Jung Wun Lee; Ofer Harel

doi:10.1080/02664763.2022.2060952

Incomplete clustering analysis via multiple imputation

J Appl Stat. 2022 Apr 12;50(9):1962-1979. doi: 10.1080/02664763.2022.2060952. eCollection 2023.

Authors

Jung Wun Lee¹, Ofer Harel¹

Affiliation

¹ Department of Statistics, Univerisity of Connecticut, Storrs, CT, USA.

Abstract

Clustering analysis is a prevalent statistical method which divides populations into several subgroups of similar units. However, most existing clustering methods require complete data. One general method that addresses incomplete data is multiple imputation (MI) which avoids many limitations found in other single imputation-based methods and complete case analyses. Nevertheless, adopting MI framework to clustering analysis can be challenging since each imputed data might consist of a different number of clusters and there is not a unique parameter for clustering analysis. In response to this problem, we have developed MICA: Multiply Imputed Cluster Analysis. MICA is a framework for clustering incomplete data consisting of two clustering stages. We assess the properties of MICA and its superiority over other existing incomplete clustering strategies based on a simulation study under various data structures. In addition, we demonstrate the usage of MICA by applying it to the Youth Risk Behavior Surveillance System (YRBSS) 2019 data.

Keywords: 62H30; Incomplete data; cluster analysis; missing data; model-based clustering; multiple imputation.

Grants and funding

This project was partially supported by Award Number DMS-2015320 from the National Science Foundation.