Evaluation of serverless computing for scalable execution of a joint variant calling workflow

PLoS One. 2021 Jul 9;16(7):e0254363. doi: 10.1371/journal.pone.0254363. eCollection 2021.

Abstract

Advances in whole-genome sequencing have greatly reduced the cost and time of obtaining raw genetic information, but the computational requirements of analysis remain a challenge. Serverless computing has emerged as an alternative to using dedicated compute resources, but its utility has not been widely evaluated for standardized genomic workflows. In this study, we define and execute a best-practice joint variant calling workflow using the SWEEP workflow management system. We present an analysis of performance and scalability, and discuss the utility of the serverless paradigm for executing workflows in the field of genomics research. The GATK best-practice short germline joint variant calling pipeline was implemented as a SWEEP workflow comprising 18 tasks. The workflow was executed on Illumina paired-end read samples from the European and African super populations of the 1000 Genomes project phase III. Cost and runtime increased linearly with increasing sample size, although runtime was driven primarily by a single task for larger problem sizes. Execution took a minimum of around 3 hours for 2 samples, up to nearly 13 hours for 62 samples, with costs ranging from $2 to $70.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Databases, Genetic
  • Genomics / methods
  • High-Throughput Nucleotide Sequencing / methods
  • Humans
  • Software
  • Workflow*

Grants and funding

DotMote Labs provided support in the form of contractor salaries to Kathleen Muenzen. The specific role of this author is articulated in the ‘author contributions’ section.” The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.