sim1000G: a user-friendly genetic variant simulator in R for unrelated individuals and family-based designs

BMC Bioinformatics. 2019 Jan 15;20(1):26. doi: 10.1186/s12859-019-2611-1.

Abstract

Background: Simulation of genetic variants data is frequently required for the evaluation of statistical methods in the fields of human and animal genetics. Although a number of high-quality genetic simulators have been developed, many of them require advanced knowledge in population genetics or in computation to be used effectively. In addition, generating simulated data in the context of family-based studies demands sophisticated methods and advanced computer programming.

Results: To address these issues, we propose a new user-friendly and integrated R package, sim1000G, which simulates variants in genomic regions among unrelated individuals or among families. The only input needed is a raw phased Variant Call Format (VCF) file. Haplotypes are extracted to compute linkage disequilibrium (LD) in the simulated genomic regions and for the generation of new genotype data among unrelated individuals. The covariance across variants is used to preserve the LD structure of the original population. Pedigrees of arbitrary sizes are generated by modeling recombination events with sim1000G. To illustrate the application of sim1000G, various scenarios are presented assuming unrelated individuals from a single population or two distinct populations, or alternatively for three-generation pedigree data. Sim1000G can capture allele frequency diversity, short and long-range linkage disequilibrium (LD) patterns and subtle population differences in LD structure without the need of any tuning parameters.

Conclusion: Sim1000G fills a gap in the vast area of genetic variants simulators by its simplicity and independence from external tools. Currently, it is one of the few simulation packages completely integrated into R and able to simulate multiple genetic variants among unrelated individuals and within families. Its implementation will facilitate the application and development of computational methods for association studies with both rare and common variants.

Keywords: 1000 genomes; Linkage disequilibrium; NGS; Pedigree data; Sequencing; Simulation.

MeSH terms

  • Computational Biology / methods*
  • Female
  • Genetic Linkage*
  • Genetic Markers*
  • Genetics, Population*
  • Humans
  • Linkage Disequilibrium
  • Male
  • Models, Genetic*
  • Pedigree
  • Polymorphism, Single Nucleotide*
  • Software*

Substances

  • Genetic Markers