A clustering package for nucleotide sequences using Laplacian Eigenmaps and Gaussian Mixture Model

Marine Bruneau; Thierry Mottet; Serge Moulin; Maël Kerbiriou; Franz Chouly; Stéphane Chretien; Christophe Guyeux

doi:10.1016/j.compbiomed.2017.12.003

A clustering package for nucleotide sequences using Laplacian Eigenmaps and Gaussian Mixture Model

Comput Biol Med. 2018 Feb 1:93:66-74. doi: 10.1016/j.compbiomed.2017.12.003. Epub 2017 Dec 15.

Authors

Marine Bruneau¹, Thierry Mottet², Serge Moulin³, Maël Kerbiriou¹, Franz Chouly¹, Stéphane Chretien⁴, Christophe Guyeux²

Affiliations

¹ Laboratoire de Mathématiques de Besançon, UMR 6623 CNRS, France; Université de Bourgogne Franche-Comté, 16 route de Gray, 25030 Besançon, France.
² Computer Science Department, FEMTO-ST Institute, UMR 6174 CNRS, France; Université de Bourgogne Franche-Comté, 16 route de Gray, 25030 Besançon, France.
³ Computer Science Department, FEMTO-ST Institute, UMR 6174 CNRS, France; Université de Bourgogne Franche-Comté, 16 route de Gray, 25030 Besançon, France. Electronic address: serge.moulin@univ-fcomte.fr.
⁴ National Physical Laboratory, Hampton Road, Teddington, United Kingdom.

PMID: 29288886
DOI: 10.1016/j.compbiomed.2017.12.003

Abstract

In this article, a new Python package for nucleotide sequences clustering is proposed. This package, freely available on-line, implements a Laplacian eigenmap embedding and a Gaussian Mixture Model for DNA clustering. It takes nucleotide sequences as input, and produces the optimal number of clusters along with a relevant visualization. Despite the fact that we did not optimise the computational speed, our method still performs reasonably well in practice. Our focus was mainly on data analytics and accuracy and as a result, our approach outperforms the state of the art, even in the case of divergent sequences. Furthermore, an a priori knowledge on the number of clusters is not required here. For the sake of illustration, this method is applied on a set of 100 DNA sequences taken from the mitochondrially encoded NADH dehydrogenase 3 (ND3) gene, extracted from a collection of Platyhelminthes and Nematoda species. The resulting clusters are tightly consistent with the phylogenetic tree computed using a maximum likelihood approach on gene alignment. They are coherent too with the NCBI taxonomy. Further test results based on synthesized data are then provided, showing that the proposed approach is better able to recover the clusters than the most widely used software, namely Cd-hit-est and BLASTClust.

Keywords: DNA clustering; Gaussian mixture model; Genomics; Laplacian eigenmap.

MeSH terms

Animals
Helminth Proteins / genetics*
Models, Genetic*
NADH Dehydrogenase / genetics*
Nematoda / enzymology
Nematoda / genetics*
Platyhelminths / enzymology
Platyhelminths / genetics*
Programming Languages*
Sequence Analysis, DNA / methods*

Substances

Helminth Proteins
NADH Dehydrogenase