The next-generation K-means algorithm

Eugene Demidenko

doi:10.1002/sam.11379

The next-generation K-means algorithm

Stat Anal Data Min. 2018 Aug;11(4):153-166. doi: 10.1002/sam.11379. Epub 2018 May 11.

Author

Eugene Demidenko¹

Affiliation

¹ Department of Biomedical Data Science and Department of Mathematics Dartmouth College Hanover New Hampshire.

Abstract

Typically, when referring to a model-based classification, the mixture distribution approach is understood. In contrast, we revive the hard-classification model-based approach developed by Banfield and Raftery (1993) for which K-means is equivalent to the maximum likelihood (ML) estimation. The next-generation K-means algorithm does not end after the classification is achieved, but moves forward to answer the following fundamental questions: Are there clusters, how many clusters are there, what are the statistical properties of the estimated means and index sets, what is the distribution of the coefficients in the clusterwise regression, and how to classify multilevel data? The statistical model-based approach for the K-means algorithm is the key, because it allows statistical simulations and studying the properties of classification following the track of the classical statistics. This paper illustrates the application of the ML classification to testing the no-clusters hypothesis, to studying various methods for selection of the number of clusters using simulations, robust clustering using Laplace distribution, studying properties of the coefficients in clusterwise regression, and finally to multilevel data by marrying the variance components model with K-means.

Keywords: K‐medians; clusterwise regression; hard classification; maximum likelihood; multilevel data; robust clustering, SigClust.

Abstract

Grants and funding