RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches

Genome Biol. 2023 May 17;24(1):121. doi: 10.1186/s13059-023-02961-6.

Abstract

We present RabbitTClust, a fast and memory-efficient genome clustering tool based on sketch-based distance estimation. Our approach enables efficient processing of large-scale datasets by combining dimensionality reduction techniques with streaming and parallelization on modern multi-core platforms. 113,674 complete bacterial genome sequences from RefSeq, 455 GB in FASTA format, can be clustered within less than 6 min and 1,009,738 GenBank assembled bacterial genomes, 4.0 TB in FASTA format, within only 34 min on a 128-core workstation. Our results further identify 1269 redundant genomes, with identical nucleotide content, in the RefSeq bacterial genomes database.

Keywords: Big data; Genome clustering; MinHash sketching; Minimum spanning tree; Redundancy detection.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Bacteria
  • Cluster Analysis
  • Databases, Nucleic Acid
  • Genome*
  • Genome, Bacterial
  • Software*