A Fast, Reproducible, High-throughput Variant Calling Workflow for Population Genomics

Mol Biol Evol. 2024 Jan 3;41(1):msad270. doi: 10.1093/molbev/msad270.

Abstract

The increasing availability of genomic resequencing data sets and high-quality reference genomes across the tree of life present exciting opportunities for comparative population genomic studies. However, substantial challenges prevent the simple reuse of data across different studies and species, arising from variability in variant calling pipelines, data quality, and the need for computationally intensive reanalysis. Here, we present snpArcher, a flexible and highly efficient workflow designed for the analysis of genomic resequencing data in nonmodel organisms. snpArcher provides a standardized variant calling pipeline and includes modules for variant quality control, data visualization, variant filtering, and other downstream analyses. Implemented in Snakemake, snpArcher is user-friendly, reproducible, and designed to be compatible with high-performance computing clusters and cloud environments. To demonstrate the flexibility of this pipeline, we applied snpArcher to 26 public resequencing data sets from nonmammalian vertebrates. These variant data sets are hosted publicly to enable future comparative population genomic analyses. With its extensibility and the availability of public data sets, snpArcher will contribute to a broader understanding of genetic variation across species by facilitating the rapid use and reuse of large genomic data sets.

Keywords: comparative population genomics; conservation genomics; evolutionary genomics; genomic workflow.

MeSH terms

  • Animals
  • Genomics
  • High-Throughput Nucleotide Sequencing
  • Metagenomics*
  • Sequence Analysis, DNA
  • Software*
  • Workflow