StLiter: A Novel Algorithm to Iteratively Build the Compacted de Bruijn Graph From Many Complete Genomes

IEEE/ACM Trans Comput Biol Bioinform. 2022 Jul-Aug;19(4):2471-2483. doi: 10.1109/TCBB.2021.3062068. Epub 2022 Aug 8.

Abstract

Recently, the compacted de Bruijn graph (cDBG) of complete genome sequences was successfully used in read mapping due to its ability to deal with the repetitions in genomes. However, current approaches are not flexible enough to fit frequently building the graphs with different k-mer lengths. Instead of building the graph directly, how can we build the compacted de Bruijin graph of longer k-mer based on the one of short k-mer? In this article, we present StLiter, a novel algorithm to build the compacted de Bruijn graph either directly from genome sequences or indirectly based on the graph of a short k-mer. For 100 simulated human genomes, StLiter can construct the graph of k-mer length 15-18 in 2.5-3.2 hours with maximal ∼70GB memory in the case of without considering the reverese complements of the reference genomes. And it costs 4.5-5.9 hours when considering the reverse complements. In experiments, we compared StLiter with TwoPaCo, the state-of-art method for building the graph, on 4 datasets. For k-mer length 15-18, StLiter can build the graph 5-9 times faster than TwoPaCo using less maximal memory cost. For k-mer length larger than 18, given the graph of a short (k- x)-mer, such as x= 1-2, compared with TwoPaCo building the graph directly, StLiter can also build the graph more efficiently. The source codes of StLiter can be downloaded from web site https://github.com/BioLab-cz/StLiter.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Genome, Human / genetics
  • High-Throughput Nucleotide Sequencing* / methods
  • Humans
  • Sequence Analysis, DNA / methods
  • Software*