ddrage: A data set generator to evaluate ddRADseq analysis software

Mol Ecol Resour. 2018 May;18(3):681-690. doi: 10.1111/1755-0998.12743. Epub 2017 Dec 27.

Abstract

High-throughput sequencing makes it possible to evaluate thousands of genetic markers across genomes and populations. Reduced-representation sequencing approaches, like double-digest restriction site-associated DNA sequencing (ddRADseq), are frequently applied to screen for genetic variation. In particular in nonmodel organisms where whole-genome sequencing is not yet feasible, ddRADseq has become popular as it allows genomewide assessment of variation patterns even in the absence of other genomic resources. However, while many tools are available for the analysis of ddRADseq data, few options exist to simulate ddRADseq data in order to evaluate the accuracy of downstream tools. The available tools either focus on the optimization of ddRAD experiment design or do not provide the information necessary for a detailed evaluation of different ddRAD analysis tools. For this task, a ground truth, that is, the underlying information of all effects in the data set, is required. Therefore, we here present ddrage, the ddRAD Data Set Generator, that allows both developers and users to evaluate their ddRAD analysis software. ddrage allows the user to adjust many parameters such as coverage and rates of mutations, sequencing errors or allelic dropouts, in order to generate a realistic simulated ddRADseq data set for given experimental scenarios and organisms. The simulated reads can be easily processed with available analysis software such as stacks or pyrad and evaluated against the underlying parameters used to generate the data to gauge the impact of different parameter values used during downstream data processing.

Keywords: RAD; coverage; dropout; evaluation; in silico simulation; mathematical modelling.

MeSH terms

  • Computer Simulation
  • Datasets as Topic*
  • Models, Genetic
  • Models, Theoretical
  • Sequence Analysis / methods*
  • Software*