Shared Nearest Neighbor Clustering in a Locality Sensitive Hashing Framework

Sawsan Kanj; Thomas Brüls; Stéphane Gazut

doi:10.1089/cmb.2017.0113

Shared Nearest Neighbor Clustering in a Locality Sensitive Hashing Framework

J Comput Biol. 2018 Feb;25(2):236-250. doi: 10.1089/cmb.2017.0113. Epub 2017 Sep 27.

Authors

Sawsan Kanj^{1

2

3

4

5}, Thomas Brüls^{1

3

4

5}, Stéphane Gazut²

Affiliations

¹ 1 CEA , Genoscope, Evry, France .
² 2 CEA, LIST, Laboratoire d'Analyse de Données et Intelligence des Systèmes, Gif-sur-Yvette, France .
³ 3 Université d'Evry , Evry, France .
⁴ 4 CNRS-UMR 8030 , Evry, France .
⁵ 5 Université Paris-Saclay , Evry, France .

PMID: 28953425
DOI: 10.1089/cmb.2017.0113

Abstract

We present a new algorithm to cluster high-dimensional sequence data and its application to the field of metagenomics, which aims at reconstructing individual genomes from a mixture of genomes sampled from an environmental site, without any prior knowledge of reference data (genomes) or the shape of clusters. Such problems typically cannot be solved directly with classical approaches seeking to estimate the density of clusters, for example, using the shared nearest neighbors (SNN) rule, due to the prohibitive size of contemporary sequence datasets. We explore here a new approach based on combining the SNN rule with the concept of locality sensitive hashing (LSH). The proposed method, called LSH-SNN, works by randomly splitting the input data into smaller-sized subsets (buckets) and employing the SNN rule on each of these buckets. Links can be created among neighbors sharing a sufficient number of elements, hence allowing clusters to be grown from linked elements. LSH-SNN can scale up to larger datasets consisting of millions of sequences, while achieving high accuracy across a variety of sample sizes and complexities.

Keywords: density-based methods; locality sensitive hashing; metagenomic data; sequence clustering.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Cluster Analysis
Genome, Bacterial
Genomics / methods*
Metagenome*
Sequence Analysis, DNA / methods*