Confounded by sequencing depth in association studies of rare alleles

Genet Epidemiol. 2011 May;35(4):261-8. doi: 10.1002/gepi.20574.

Abstract

Next-generation DNA sequencing technologies are facilitating large-scale association studies of rare genetic variants. The depth of the sequence read coverage is an important experimental variable in the next-generation technologies and it is a major determinant of the quality of genotype calls generated from sequence data. When case and control samples are sequenced separately or in different proportions across batches, they are unlikely to be matched on sequencing read depth and a differential misclassification of genotypes can result, causing confounding and an increased false-positive rate. Data from Pilot Study 3 of the 1000 Genomes project was used to demonstrate that a difference between the mean sequencing read depth of case and control samples can result in false-positive association for rare and uncommon variants, even when the mean coverage depth exceeds 30× in both groups. The degree of the confounding and inflation in the false-positive rate depended on the extent to which the mean depth was different in the case and control groups. A logistic regression model was used to test for association between case-control status and the cumulative number of alleles in a collapsed set of rare and uncommon variants. Including each individual's mean sequence read depth across the variant sites in the logistic regression model nearly eliminated the confounding effect and the inflated false-positive rate. Furthermore, accounting for the potential error by modeling the probability of the heterozygote genotype calls in the regression analysis had a relatively minor but beneficial effect on the statistical results.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Case-Control Studies
  • Confounding Factors, Epidemiologic
  • Gene Frequency
  • Genetic Variation*
  • Genome, Human
  • Genome-Wide Association Study*
  • Genotype
  • Heterozygote
  • High-Throughput Nucleotide Sequencing*
  • Humans
  • Models, Genetic
  • Pilot Projects
  • Polymorphism, Single Nucleotide
  • Probability
  • Regression Analysis
  • Sequence Analysis, DNA*