FSG: Fast String Graph Construction for De Novo Assembly

J Comput Biol. 2017 Oct;24(10):953-968. doi: 10.1089/cmb.2017.0089. Epub 2017 Jul 17.

Abstract

The string graph for a collection of next-generation reads is a lossless data representation that is fundamental for de novo assemblers based on the overlap-layout-consensus paradigm. In this article, we explore a novel approach to compute the string graph, based on the FM-index and Burrows and Wheeler Transform. We describe a simple algorithm that uses only the FM-index representation of the collection of reads to construct the string graph, without accessing the input reads. Our algorithm has been integrated into the string graph assembler (SGA) as a standalone module to construct the string graph. The new integrated assembler has been assessed on a standard benchmark, showing that fast string graph (FSG) is significantly faster than SGA while maintaining a moderate use of main memory, and showing practical advantages in running FSG on multiple threads. Moreover, we have studied the effect of coverage rates on the running times.

Keywords: Burrows and Wheeler Transform; genome assembly; string graph.

MeSH terms

  • Algorithms*
  • Computational Biology / methods*
  • Genome, Human
  • Genomics / methods*
  • High-Throughput Nucleotide Sequencing / methods*
  • Humans
  • Sequence Analysis, DNA / methods*