Reproducibly sampling SARS-CoV-2 genomes across time, geography, and viral diversity

F1000Res. 2020 Jun 29:9:657. doi: 10.12688/f1000research.24751.2. eCollection 2020.

Abstract

The COVID-19 pandemic has led to a rapid accumulation of SARS-CoV-2 genomes, enabling genomic epidemiology on local and global scales. Collections of genomes from resources such as GISAID must be subsampled to enable computationally feasible phylogenetic and other analyses. We present genome-sampler, a software package that supports sampling collections of viral genomes across multiple axes including time of genome isolation, location of genome isolation, and viral diversity. The software is modular in design so that these or future sampling approaches can be applied independently and combined (or replaced with a random sampling approach) to facilitate custom workflows and benchmarking. genome-sampler is written as a QIIME 2 plugin, ensuring that its application is fully reproducible through QIIME 2's unique retrospective data provenance tracking system. genome-sampler can be installed in a conda environment on macOS or Linux systems. A complete default pipeline is available through a Snakemake workflow, so subsampling can be achieved using a single command. genome-sampler is open source, free for all to use, and available at https://caporasolab.us/genome-sampler. We hope that this will facilitate SARS-CoV-2 research and support evaluation of viral genome sampling approaches for genomic epidemiology.

Keywords: QIIME 2; SARS-CoV-2; bioinformatics; genome-sampler; genomics.

MeSH terms

  • COVID-19
  • Computational Biology
  • Genome, Viral*
  • Geography
  • Humans
  • Pandemics
  • Phylogeny*
  • Retrospective Studies
  • SARS-CoV-2 / genetics*
  • Software

Grants and funding

Our software development and documentation work were funded by a Chan-Zuckerberg Initiative Essential Open Source Software grant to JGC; an Alfred P Sloan Foundation grant to JGC, CH, and JS; and the National Cancer Institute of the National Institutes of Health under the awards for the Partnership of Native American Cancer Prevention U54CA143924 (UACC) and U54CA143925 (NAU) to JGC. Initial development of the QIIME 2 platform was funded in part by the National Science Foundation grant 1565100 to JGC. Testing and initial application of this software was performed on Northern Arizona University’s Monsoon computing cluster, funded by Arizona’s Technology and Research Initiative Fund. Additional analysis effort was funded under the State of Arizona Technology and Research Initiative Fund (TRIF), administered by the Arizona Board of Regents, through Northern Arizona University.