Performance evaluation of six popular short-read simulators

Mark Milhaven; Susanne P Pfeifer

doi:10.1038/s41437-022-00577-3

Performance evaluation of six popular short-read simulators

Heredity (Edinb). 2023 Feb;130(2):55-63. doi: 10.1038/s41437-022-00577-3. Epub 2022 Dec 10.

Authors

Mark Milhaven^{1

2}, Susanne P Pfeifer^{3

4}

Affiliations

¹ School of Life Sciences, Arizona State University, Tempe, AZ, 85281, USA.
² Center for Evolution and Medicine, Arizona State University, Tempe, AZ, 85281, USA.
³ School of Life Sciences, Arizona State University, Tempe, AZ, 85281, USA. susanne.pfeifer@asu.edu.
⁴ Center for Evolution and Medicine, Arizona State University, Tempe, AZ, 85281, USA. susanne.pfeifer@asu.edu.

Abstract

High-throughput sequencing data enables the comprehensive study of genomes and the variation therein. Essential for the interpretation of this genomic data is a thorough understanding of the computational methods used for processing and analysis. Whereas "gold-standard" empirical datasets exist for this purpose in humans, synthetic (i.e., simulated) sequencing data can offer important insights into the capabilities and limitations of computational pipelines for any arbitrary species and/or study design-yet, the ability of read simulator software to emulate genomic characteristics of empirical datasets remains poorly understood. We here compare the performance of six popular short-read simulators-ART, DWGSIM, InSilicoSeq, Mason, NEAT, and wgsim-and discuss important considerations for selecting suitable models for benchmarking.

Publication types

Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Benchmarking
Genome
Genomics* / methods
High-Throughput Nucleotide Sequencing / methods
Humans
Software*