Multi-assignment clustering: Machine learning from a biological perspective

J Biotechnol. 2021 Jan 20:326:1-10. doi: 10.1016/j.jbiotec.2020.12.002. Epub 2020 Dec 4.

Authors

Benjamin Ulfenborg¹, Alexander Karlsson², Maria Riveiro³, Christian X Andersson⁴, Peter Sartipy⁵, Jane Synnergren⁵

Affiliations

¹ School of Bioscience, University of Skövde, Skövde, Sweden. Electronic address: benjamin.ulfenborg@his.se.
² School of Informatics, University of Skövde, Skövde, Sweden.
³ School of Informatics, University of Skövde, Skövde, Sweden; Department of Computer Science and Informatics, School of Engineering, Jönköping University, Jönköping, Sweden.
⁴ Takara Bio Europe AB, Gothenburg, Sweden.
⁵ School of Bioscience, University of Skövde, Skövde, Sweden.

PMID: 33285150
DOI: 10.1016/j.jbiotec.2020.12.002

Abstract

A common approach for analyzing large-scale molecular data is to cluster objects sharing similar characteristics. This assumes that genes with highly similar expression profiles are likely participating in a common molecular process. Biological systems are extremely complex and challenging to understand, with proteins having multiple functions that sometimes need to be activated or expressed in a time-dependent manner. Thus, the strategies applied for clustering of these molecules into groups are of key importance for translation of data to biologically interpretable findings. Here we implemented a multi-assignment clustering (MAsC) approach that allows molecules to be assigned to multiple clusters, rather than single ones as in commonly used clustering techniques. When applied to high-throughput transcriptomics data, MAsC increased power of the downstream pathway analysis and allowed identification of pathways with high biological relevance to the experimental setting and the biological systems studied. Multi-assignment clustering also reduced noise in the clustering partition by excluding genes with a low correlation to all of the resulting clusters. Together, these findings suggest that our methodology facilitates translation of large-scale molecular data into biological knowledge. The method is made available as an R package on GitLab (https://gitlab.com/wolftower/masc).

Keywords: Annotation enrichment; Clustering; K-means; Multiple cluster assignment; Pathways; Transcriptomics.

MeSH terms

Algorithms*
Cluster Analysis
Gene Expression Profiling
Machine Learning*