Hypercluster: a flexible tool for parallelized unsupervised clustering optimization

Lili Blumenberg; Kelly V Ruggles

doi:10.1186/s12859-020-03774-1

Hypercluster: a flexible tool for parallelized unsupervised clustering optimization

BMC Bioinformatics. 2020 Sep 29;21(1):428. doi: 10.1186/s12859-020-03774-1.

Authors

Lili Blumenberg^{1

2}, Kelly V Ruggles^{3

4}

Affiliations

¹ Institute of Systems Genetics, New York University Grossman School of Medicine, New York, NY, 10016, USA.
² Department of Medicine, New York University Grossman School of Medicine, New York, NY, 10016, USA.
³ Institute of Systems Genetics, New York University Grossman School of Medicine, New York, NY, 10016, USA. kelly.ruggles@nyulangone.org.
⁴ Department of Medicine, New York University Grossman School of Medicine, New York, NY, 10016, USA. kelly.ruggles@nyulangone.org.

Abstract

Background: Unsupervised clustering is a common and exceptionally useful tool for large biological datasets. However, clustering requires upfront algorithm and hyperparameter selection, which can introduce bias into the final clustering labels. It is therefore advisable to obtain a range of clustering results from multiple models and hyperparameters, which can be cumbersome and slow.

Results: We present hypercluster, a python package and SnakeMake pipeline for flexible and parallelized clustering evaluation and selection. Users can efficiently evaluate a huge range of clustering results from multiple models and hyperparameters to identify an optimal model.

Conclusions: Hypercluster improves ease of use, robustness and reproducibility for unsupervised clustering application for high throughput biology. Hypercluster is available on pip and bioconda; installation, documentation and example workflows can be found at: https://github.com/ruggleslab/hypercluster .

Keywords: Hyperparameter optimization; Machine learning; Python; Scikit-learn; SnakeMake; Unsupervised clustering.

MeSH terms

Algorithms
Cluster Analysis
Computational Biology
User-Computer Interface*

Abstract

MeSH terms

Grants and funding