Finding the mean in a partition distribution

Thomas J Glassen; Timo von Oertzen; Dmitry A Konovalov

doi:10.1186/s12859-018-2359-z

Finding the mean in a partition distribution

BMC Bioinformatics. 2018 Oct 12;19(1):375. doi: 10.1186/s12859-018-2359-z.

Authors

Thomas J Glassen¹, Timo von Oertzen^{2

3}, Dmitry A Konovalov⁴

Affiliations

¹ Department of Psychology, Universität der Bundeswehr München, Werner-Heisenberg-Weg 39, Neubiberg, 85577, Germany.
² Department of Psychology, Universität der Bundeswehr München, Werner-Heisenberg-Weg 39, Neubiberg, 85577, Germany. timo.vonoertzen@unibw.de.
³ Max Planck Institute for Human Development, Department for Lifespan Psychology, Berlin, Lentzeallee 94, Berlin, 14195, Germany. timo.vonoertzen@unibw.de.
⁴ School of Information Technology, James Cook University, 1 James Cook Drive, Townsville, QLD 4811, Australia.

Abstract

Background: Bayesian clustering algorithms, in particular those utilizing Dirichlet Processes (DP), return a sample of the posterior distribution of partitions of a set. However, in many applied cases a single clustering solution is desired, requiring a 'best' partition to be created from the posterior sample. It is an open research question which solution should be recommended in which situation. However, one such candidate is the sample mean, defined as the clustering with minimal squared distance to all partitions in the posterior sample, weighted by their probability. In this article, we review an algorithm that approximates this sample mean by using the Hungarian Method to compute the distance between partitions. This algorithm leaves room for further processing acceleration.

Results: We highlight a faster variant of the partition distance reduction that leads to a runtime complexity that is up to two orders of magnitude lower than the standard variant. We suggest two further improvements: The first is deterministic and based on an adapted dynamical version of the Hungarian Algorithm, which achieves another runtime decrease of at least one order of magnitude. The second improvement is theoretical and uses Monte Carlo techniques and the dynamic matrix inverse. Thereby we further reduce the runtime complexity by nearly the square root of one order of magnitude.

Conclusions: Overall this results in a new mean partition algorithm with an acceleration factor reaching beyond that of the present algorithm by the size of the partitions. The new algorithm is implemented in Java and available on GitHub (Glassen, Mean Partition, 2018).

Keywords: Bayesian clustering; Dirichlet Process; Mean partition; Partition distance.

MeSH terms

Algorithms
Bayes Theorem*
Humans