SpCLUST: Towards a fast and reliable clustering for potentially divergent biological sequences

Johny Matar; Hicham El Khoury; Jean-Claude Charr; Christophe Guyeux; Stéphane Chrétien

doi:10.1016/j.compbiomed.2019.103439

SpCLUST: Towards a fast and reliable clustering for potentially divergent biological sequences

Comput Biol Med. 2019 Nov:114:103439. doi: 10.1016/j.compbiomed.2019.103439. Epub 2019 Sep 10.

Authors

Johny Matar¹, Hicham El Khoury², Jean-Claude Charr³, Christophe Guyeux⁴, Stéphane Chrétien⁵

Affiliations

¹ Université de Bourgogne Franche-Comté, UMR 6174 CNRS, 16 route de Gray, Besançon, France; LaRRIS, Faculty of Science, Lebanese University, Fanar, Lebanon.
² LaRRIS, Faculty of Science, Lebanese University, Fanar, Lebanon.
³ Université de Bourgogne Franche-Comté, UMR 6174 CNRS, 16 route de Gray, Besançon, France. Electronic address: jean-claude.charr@univ-fcomte.fr.
⁴ Université de Bourgogne Franche-Comté, UMR 6174 CNRS, 16 route de Gray, Besançon, France.
⁵ National Physical Laboratory, Hampton Road, Teddington, United Kingdom.

PMID: 31550555
DOI: 10.1016/j.compbiomed.2019.103439

Abstract

This paper presents SpCLUST, a new C++ package that takes a list of sequences as input, aligns them with MUSCLE, computes their similarity matrix in parallel and then performs the clustering. SpCLUST extends a previously released software by integrating additional scoring matrices which enables it to cover the clustering of amino-acid sequences. The similarity matrix is now computed in parallel according to the master/slave distributed architecture, using MPI. Performance analysis, realized on two real datasets of 100 nucleotide sequences and 1049 amino-acids ones, show that the resulting library substantially outperforms the original Python package. The proposed package was also intensively evaluated on simulated and real genomic and protein data sets. The clustering results were compared to the most known traditional tools, such as UCLUST, CD-HIT and DNACLUST. The comparison showed that SpCLUST outperforms the other tools when clustering divergent sequences, and contrary to the others, it does not require any user intervention or prior knowledge about the input sequences.

Keywords: Gaussian mixture model; Genomics; Laplacian eigenmaps; Parallel computation; Sequences clustering; Spectral clustering.

MeSH terms

Algorithms
Cluster Analysis*
DNA* / classification
DNA* / genetics
Genomics / methods*
Humans
Sequence Analysis, DNA / methods*
Software*

Substances

DNA