Efficient similarity-based data clustering by optimal object to cluster reallocation

PLoS One. 2018 Jun 1;13(6):e0197450. doi: 10.1371/journal.pone.0197450. eCollection 2018.

Abstract

We present an iterative flat hard clustering algorithm designed to operate on arbitrary similarity matrices, with the only constraint that these matrices be symmetrical. Although functionally very close to kernel k-means, our proposal performs a maximization of average intra-class similarity, instead of a squared distance minimization, in order to remain closer to the semantics of similarities. We show that this approach permits the relaxing of some conditions on usable affinity matrices like semi-positiveness, as well as opening possibilities for computational optimization required for large datasets. Systematic evaluation on a variety of data sets shows that compared with kernel k-means and the spectral clustering methods, the proposed approach gives equivalent or better performance, while running much faster. Most notably, it significantly reduces memory access, which makes it a good choice for large data collections. Material enabling the reproducibility of the results is made available online.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Cluster Analysis
  • Computer Simulation*
  • Models, Theoretical*
  • Semantics*

Grants and funding

This work was supported by Agence Nationale de la Recherche (FR grants ANR-11-JS03-005-01 and ANR-16-CE22-0012 to ML.