Fast Ordered Sampling of DNA Sequence Variants

Anthony J Greenberg

doi:10.1534/g3.117.300465

Fast Ordered Sampling of DNA Sequence Variants

G3 (Bethesda). 2018 May 4;8(5):1455-1460. doi: 10.1534/g3.117.300465.

Author

Anthony J Greenberg¹

Affiliation

¹ Bayesic Research, Ithaca, NY 14850 tonyg@bayesicresearch.org.

Abstract

Explosive growth in the amount of genomic data is matched by increasing power of consumer-grade computers. Even applications that require powerful servers can be quickly tested on desktop or laptop machines if we can generate representative samples from large data sets. I describe a fast and memory-efficient implementation of an on-line sampling method developed for tape drives 30 years ago. Focusing on genotype files, I test the performance of this technique on modern solid-state and spinning hard drives, and show that it performs well compared to a simple sampling scheme. I illustrate its utility by developing a method to quickly estimate genome-wide patterns of linkage disequilibrium (LD) decay with distance. I provide open-source software that samples loci from several variant format files, a separate program that performs LD decay estimates, and a C++ library that lets developers incorporate these methods into their own projects.

Keywords: C++; genomics; nucleotide polymorphism; random sampling; statistical genetics.

MeSH terms

Animals
Base Sequence
Drosophila / genetics
Genetic Loci
Genetic Variation*
Linkage Disequilibrium / genetics
Oryza / genetics
Polymorphism, Single Nucleotide / genetics
Time Factors