Latent Model-Based Clustering for Biological Discovery

Xin Bing; Florentina Bunea; Martin Royer; Jishnu Das

doi:10.1016/j.isci.2019.03.018

Latent Model-Based Clustering for Biological Discovery

iScience. 2019 Apr 26:14:125-135. doi: 10.1016/j.isci.2019.03.018. Epub 2019 Mar 21.

Authors

Xin Bing¹, Florentina Bunea², Martin Royer³, Jishnu Das⁴

Affiliations

¹ Department of Statistical Science, Cornell University, Ithaca, NY 14853, USA.
² Department of Statistical Science, Cornell University, Ithaca, NY 14853, USA. Electronic address: fb238@cornell.edu.
³ Department of Statistical Science, Cornell University, Ithaca, NY 14853, USA; Department of Mathematics, Universite Paris-Sud, 91405 Orsay, France.
⁴ Ragon Institute of MGH, Harvard, MIT, Cambridge, MA 02139, USA; Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA. Electronic address: jd327@cornell.edu.

Abstract

LOVE, a robust, scalable latent model-based clustering method for biological discovery, can be used across a range of datasets to generate both overlapping and non-overlapping clusters. In our formulation, a cluster comprises variables associated with the same latent factor and is determined from an allocation matrix that indexes our latent model. We prove that the allocation matrix and corresponding clusters are uniquely defined. We apply LOVE to biological datasets (gene expression, serological responses measured from HIV controllers and chronic progressors, vaccine-induced humoral immune responses) resulting in meaningful biological output. For all three datasets, the clusters generated by LOVE remain stable across tuning parameters. Finally, we compared LOVE's performance to that of 13 state-of-the-art methods using previously established benchmarks and found that LOVE outperformed these methods across datasets. Our results demonstrate that LOVE can be broadly used across large-scale biological datasets to generate accurate and meaningful overlapping and non-overlapping clusters.

Keywords: Bioinformatics; Biological Sciences; Statistical Computing.