A new expectation-maximization statistical test for case-control association studies considering rare variants obtained by high-throughput sequencing

Derek Gordon; Stephen J Finch; Francisco M De La Vega

doi:10.1159/000325590

A new expectation-maximization statistical test for case-control association studies considering rare variants obtained by high-throughput sequencing

Hum Hered. 2011;71(2):113-25. doi: 10.1159/000325590. Epub 2011 Jul 6.

Authors

Derek Gordon¹, Stephen J Finch, Francisco M De La Vega

Affiliation

¹ Department of Genetics, Rutgers University, Piscataway, N.J., USA.

PMID: 21734402
DOI: 10.1159/000325590

Abstract

Genome-wide association studies (GWAS) have been successful in identifying common genetic variation reproducibly associated with disease. However, most associated variants confer very small risk and after meta-analysis of large cohorts a large fraction of expected heritability still remains unexplained. A possible explanation is that rare variants currently undetected by GWAS with SNP arrays could contribute a large fraction of risk when present in cases. This concept has spurred great interest in exploring the role of rare variants in disease. As the cost of sequencing continue to plummet, it is becoming feasible to directly sequence case-control samples for testing disease association including rare variants. We have developed a test statistic that allows for association testing among cases and controls using data directly from sequencing reads. In addition, our method allows for random errors in reads. We determine the probability of a true genotype call based on the observed base pair reads using the expectation-maximization algorithm. We apply the SumStat procedure to obtain a single statistic for a group of multiple rare variant loci. We document the validity of our method through simulations. Our results suggest that our statistic maintains the correct type I error rate, even in the presence of differential misclassification for sequence reads, and that it has good power under a number of scenarios. Finally, our SumStat results show power at least as good as the maximum single locus results.

MeSH terms

Algorithms*
Base Sequence
Case-Control Studies
Gene Frequency
Genetic Predisposition to Disease / genetics*
Genome-Wide Association Study
Genotype
Haplotypes
High-Throughput Nucleotide Sequencing / methods*
Humans
Polymorphism, Single Nucleotide*
Sequence Homology, Nucleic Acid