Nubeam-dedup: a fast and RAM-efficient tool to de-duplicate sequencing reads without mapping

Bioinformatics. 2020 May 1;36(10):3254-3256. doi: 10.1093/bioinformatics/btaa112.

Abstract

Summary: We present Nubeam-dedup, a fast and RAM-efficient tool to de-duplicate sequencing reads without reference genome. Nubeam-dedup represents nucleotides by matrices, transforms reads into products of matrices, and based on which assigns a unique number to a read. Thus, duplicate reads can be efficiently removed by using a collisionless hash function. Compared with other state-of-the-art reference-free tools, Nubeam-dedup uses 50-70% of CPU time and 10-15% of RAM.

Availability and implementation: Source code in C++ and manual are available at https://github.com/daihang16/nubeamdedup and https://haplotype.org.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Genome
  • High-Throughput Nucleotide Sequencing*
  • Sequence Analysis, DNA
  • Software*