KCMBT: a k-mer Counter based on Multiple Burst Trees

Abdullah-Al Mamun; Soumitra Pal; Sanguthevar Rajasekaran

doi:10.1093/bioinformatics/btw345

KCMBT: a k-mer Counter based on Multiple Burst Trees

Bioinformatics. 2016 Sep 15;32(18):2783-90. doi: 10.1093/bioinformatics/btw345. Epub 2016 Jun 9.

Authors

Abdullah-Al Mamun¹, Soumitra Pal¹, Sanguthevar Rajasekaran¹

Affiliation

¹ Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269, USA.

Abstract

Motivation: A massive number of bioinformatics applications require counting of k-length substrings in genetically important long strings. A k-mer counter generates the frequencies of each k-length substring in genome sequences. Genome assembly, repeat detection, multiple sequence alignment, error detection and many other related applications use a k-mer counter as a building block. Very fast and efficient algorithms are necessary to count k-mers in large data sets to be useful in such applications.

Results: We propose a novel trie-based algorithm for this k-mer counting problem. We compare our devised algorithm k-mer Counter based on Multiple Burst Trees (KCMBT) with available all well-known algorithms. Our experimental results show that KCMBT is around 30% faster than the previous best-performing algorithm KMC2 for human genome dataset. As another example, our algorithm is around six times faster than Jellyfish2. Overall, KCMBT is 20-30% faster than KMC2 on five benchmark data sets when both the algorithms were run using multiple threads.

Availability and implementation: KCMBT is freely available on GitHub: (https://github.com/abdullah009/kcmbt_mt).

Contact: rajasek@engr.uconn.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

MeSH terms

Algorithms*
Base Sequence
Computational Biology / methods
Genome
Humans
Sequence Alignment*
Sequence Analysis, DNA*
Software

Grants and funding

R01 LM010101/LM/NLM NIH HHS/United States