kmcEx: memory-frugal and retrieval-efficient encoding of counted k-mers

Peng Jiang; Jie Luo; Yiqi Wang; Pingji Deng; Bertil Schmidt; Xiangjun Tang; Ningjiang Chen; Limsoon Wong; Liang Zhao

doi:10.1093/bioinformatics/btz299

kmcEx: memory-frugal and retrieval-efficient encoding of counted k-mers

Bioinformatics. 2019 Dec 1;35(23):4871-4878. doi: 10.1093/bioinformatics/btz299.

Authors

Peng Jiang¹, Jie Luo¹, Yiqi Wang¹, Pingji Deng¹, Bertil Schmidt², Xiangjun Tang¹, Ningjiang Chen³, Limsoon Wong⁴, Liang Zhao^{1

3}

Affiliations

¹ Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, Hubei, China.
² Institute of Computer Science, Johannes Gutenberg University Mainz, Mainz Germany.
³ School of Computing and Electronic Information, Guangxi University, Nanning, Guangxi, China.
⁴ School of Computing, National University of Singapore, Singapore, Singapore.

PMID: 31038666
DOI: 10.1093/bioinformatics/btz299

Abstract

Motivation: K-mers along with their frequency have served as an elementary building block for error correction, repeat detection, multiple sequence alignment, genome assembly, etc., attracting intensive studies in k-mer counting. However, the output of k-mer counters itself is large; very often, it is too large to fit into main memory, leading to highly narrowed usability.

Results: We introduce a novel idea of encoding k-mers as well as their frequency, achieving good memory saving and retrieval efficiency. Specifically, we propose a Bloom filter-like data structure to encode counted k-mers by coupled-bit arrays-one for k-mer representation and the other for frequency encoding. Experiments on five real datasets show that the average memory-saving ratio on all 31-mers is as high as 13.81 as compared with raw input, with 7 hash functions. At the same time, the retrieval time complexity is well controlled (effectively constant), and the false-positive rate is decreased by two orders of magnitude.

Availability and implementation: The source codes of our algorithm are available at github.com/lzhLab/kmcEx.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Sequence Alignment
Sequence Analysis, DNA
Software*