Using Apache Spark on genome assembly for scalable overlap-graph reduction

Hum Genomics. 2019 Oct 22;13(Suppl 1):48. doi: 10.1186/s40246-019-0227-1.

Abstract

Background: De novo genome assembly is a technique that builds the genome of a specimen using overlaps of genomic fragments without additional work with reference sequence. Sequence fragments (called reads) are assembled as contigs and scaffolds by the overlaps. The quality of the de novo assembly depends on the length and continuity of the assembly. To enable faster and more accurate assembly of species, existing sequencing techniques have been proposed, for example, high-throughput next-generation sequencing and long-reads-producing third-generation sequencing. However, these techniques require a large amounts of computer memory when very huge-size overlap graphs are resolved. Also, it is challenging for parallel computation.

Results: To address the limitations, we propose an innovative algorithmic approach, called Scalable Overlap-graph Reduction Algorithms (SORA). SORA is an algorithm package that performs string graph reduction algorithms by Apache Spark. The SORA's implementations are designed to execute de novo genome assembly on either a single machine or a distributed computing platform. SORA efficiently compacts the number of edges on enormous graphing paths by adapting scalable features of graph processing libraries provided by Apache Spark, GraphX and GraphFrames.

Conclusions: We shared the algorithms and the experimental results at our project website, https://github.com/BioHPC/SORA . We evaluated SORA with the human genome samples. First, it processed a nearly one billion edge graph on a distributed cloud cluster. Second, it processed mid-to-small size graphs on a single workstation within a short time frame. Overall, SORA achieved the linear-scaling simulations for the increased computing instances.

Keywords: Apache spark; Cloud computing; Genome assembly; Graph reduction; Overlap-layout-consensus.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms*
  • Base Sequence
  • Conyza / genetics
  • Databases, Genetic
  • Genome*
  • Genome, Human
  • Genome, Plant
  • Humans
  • Sequence Analysis, DNA*