A Dirichlet model of alignment cost in mixed-membership unsupervised clustering

Xiran Liu; Naama M Kopelman; Noah A Rosenberg

doi:10.1080/10618600.2022.2127739

A Dirichlet model of alignment cost in mixed-membership unsupervised clustering

J Comput Graph Stat. 2023;32(3):1145-1159. doi: 10.1080/10618600.2022.2127739. Epub 2022 Nov 14.

Authors

Xiran Liu¹, Naama M Kopelman², Noah A Rosenberg³

Affiliations

¹ Institute for Computational and Mathematical Engineering, Stanford University.
² Faculty of Sciences, Holon Institute of Technology.
³ Department of Biology, Stanford University.

Abstract

Mixed-membership unsupervised clustering is widely used to extract informative patterns from data in many application areas. For a shared data set, the stochasticity and unsupervised nature of clustering algorithms can cause difficulties in comparing clustering results produced by different algorithms, or even multiple runs of the same algorithm, as outcomes can differ owing to permutation of the cluster labels or genuine differences in clustering results. Here, with a focus on inference of individual genetic ancestry in population-genetic studies, we study the cost of misalignment of mixed-membership unsupervised clustering replicates under a theoretical model of cluster memberships. Using Dirichlet distributions to model membership coefficient vectors, we provide theoretical results quantifying the alignment cost as a function of the Dirichlet parameters and the Hamming permutation difference between replicates. For fixed Dirichlet parameters, the alignment cost is seen to increase with the Hamming distance between permutations. Data sets with low variance across individuals of membership coefficients for specific clusters generally produce high misalignment costs-so that a single optimal permutation has far lower cost than suboptimal permutations. Higher variability in data, as represented by greater variance of membership coefficients, generally results in alignment costs that are similar between the optimal permutation and suboptimal permutations. We demonstrate the application of the theoretical results to data simulated under the Dirichlet model, as well as to membership estimates from inference of human-genetic ancestry. The results can contribute to improving cluster alignment algorithms that seek to find optimal permutations of replicates.

Keywords: Dirichlet model; admixture; label-switching; multimodality.

Grants and funding

R01 HG005855/HG/NHGRI NIH HHS/United States