Sparse clusterability: testing for cluster structure in high dimensions

Jose Laborde; Paul A Stewart; Zhihua Chen; Yian A Chen; Naomi C Brownstein

doi:10.1186/s12859-023-05210-6

Sparse clusterability: testing for cluster structure in high dimensions

BMC Bioinformatics. 2023 Mar 31;24(1):125. doi: 10.1186/s12859-023-05210-6.

Authors

Jose Laborde¹, Paul A Stewart^{2

3}, Zhihua Chen², Yian A Chen^{2

3}, Naomi C Brownstein^{4

5

6}

Affiliations

¹ Department of Biostatistics and Bioinformatics, Moffitt Cancer Center, Tampa, FL, USA. jose.laborde@moffitt.org.
² Department of Biostatistics and Bioinformatics, Moffitt Cancer Center, Tampa, FL, USA.
³ Department of Oncologic Sciences, University of South Florida, Tampa, FL, USA.
⁴ Department of Biostatistics and Bioinformatics, Moffitt Cancer Center, Tampa, FL, USA. brownstn@musc.edu.
⁵ Department of Oncologic Sciences, University of South Florida, Tampa, FL, USA. brownstn@musc.edu.
⁶ Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC, USA. brownstn@musc.edu.

Abstract

Background: Cluster analysis is utilized frequently in scientific theory and applications to separate data into groups. A key assumption in many clustering algorithms is that the data was generated from a population consisting of multiple distinct clusters. Clusterability testing allows users to question the inherent assumption of latent cluster structure, a theoretical requirement for meaningful results in cluster analysis.

Results: This paper proposes methods for clusterability testing designed for high-dimensional data by utilizing sparse principal component analysis. Type I error and power of the clusterability tests are evaluated using simulated data with different types of cluster structure in high dimensions. Empirical performance of the new methods is evaluated and compared with prior methods on gene expression, microarray, and shotgun proteomics data. Our methods had reasonably low Type I error and maintained power for many datasets with a variety of structures and dimensions. Cluster structure was not detectable in other datasets with spatially close clusters.

Conclusion: This is the first analysis of clusterability testing on both simulated and real-world high-dimensional data.

Keywords: Big data; Cluster analysis; Cluster tendency; Clustering; Dimension reduction; Distance metrics; Multimodality testing; Principal component analysis; Sparsity.

MeSH terms

Algorithms*
Cluster Analysis

Abstract

MeSH terms

Grants and funding