PipeMEM: A Framework to Speed Up BWA-MEM in Spark with Low Overhead

Genes (Basel). 2019 Nov 4;10(11):886. doi: 10.3390/genes10110886.

Abstract

(1) Background: DNA sequence alignment process is an essential step in genome analysis. BWA-MEM has been a prevalent single-node tool in genome alignment because of its high speed and accuracy. The exponentially generated genome data requiring a multi-node solution to handle large volumes of data currently remains a challenge. Spark is a ubiquitous big data platform that has been exploited to assist genome alignment in handling this challenge. Nonetheless, existing works that utilize Spark to optimize BWA-MEM suffer from higher overhead. (2) Methods: In this paper, we presented PipeMEM, a framework to accelerate BWA-MEM with lower overhead with the help of the pipe operation in Spark. We additionally proposed to use a pipeline structure and in-memory-computation to accelerate PipeMEM. (3) Results: Our experiments showed that, on paired-end alignment tasks, our framework had low overhead. In a multi-node environment, our framework, on average, was 2.27× faster compared with BWASpark (an alignment tool in Genome Analysis Toolkit (GATK)), and 2.33× faster compared with SparkBWA. (4) Conclusions: PipeMEM could accelerate BWA-MEM in the Spark environment with high performance and low overhead.

Keywords: BWA-MEM; Spark; low overhead.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Big Data
  • Chromosome Mapping
  • Genome, Human
  • High-Throughput Nucleotide Sequencing / methods*
  • Humans
  • Sequence Alignment / methods*
  • Sequence Analysis, DNA / methods*
  • Software