Comprehensive evaluation of imputation performance in African Americans

Pritam Chanda; Naoya Yuhki; Man Li; Joel S Bader; Alex Hartz; Eric Boerwinkle; W H Linda Kao; Dan E Arking

doi:10.1038/jhg.2012.43

Comprehensive evaluation of imputation performance in African Americans

J Hum Genet. 2012 Jul;57(7):411-21. doi: 10.1038/jhg.2012.43. Epub 2012 May 31.

Authors

Pritam Chanda¹, Naoya Yuhki, Man Li, Joel S Bader, Alex Hartz, Eric Boerwinkle, W H Linda Kao, Dan E Arking

Affiliation

¹ Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21205, USA.

Abstract

Imputation of genome-wide single-nucleotide polymorphism (SNP) arrays to a larger known reference panel of SNPs has become a standard and an essential part of genome-wide association studies. However, little is known about the behavior of imputation in African Americans with respect to the different imputation algorithms, the reference population(s) and the reference SNP panels used. Genome-wide SNP data (Affymetrix 6.0) from 3207 African American samples in the Atherosclerosis Risk in Communities Study (ARIC) was used to systematically evaluate imputation quality and yield. Imputation was performed with the imputation algorithms MACH, IMPUTE and BEAGLE using several combinations of three reference panels of HapMap III (ASW, YRI and CEU) and 1000 Genomes Project (pilot 1 YRI June 2010 release, EUR and AFR August 2010 and June 2011 releases) panels with SNP data on chromosomes 18, 20 and 22. About 10% of the directly genotyped SNPs from each chromosome were masked, and SNPs common between the reference panels were used for evaluating the imputation quality using two statistical metrics-concordance accuracy and Cohen's kappa (κ) coefficient. The dependencies of these metrics on the minor allele frequencies (MAF) and specific genotype categories (minor allele homozygotes, heterozygotes and major allele homozygotes) were thoroughly investigated to determine the best panel and method for imputation in African Americans. In addition, the power to detect imputed SNPs associated with simulated phenotypes was studied using the mean genotype of each masked SNP in the imputed data. Our results indicate that the genotype concordances after stratification into each genotype category and Cohen's κ coefficient are considerably better equipped to differentiate imputation performance compared with the traditionally used total concordance statistic, and both statistics improved with increasing MAF irrespective of the imputation method. We also find that both MACH and IMPUTE performed equally well and consistently better than BEAGLE irrespective of the reference panel used. Of the various combinations of reference panels, for both HapMap III and 1000 Genomes Project reference panels, the multi-ethnic panels had better imputation accuracy than those containing only single ethnic samples. The most recent 1000 Genomes Project release June 2011 had substantially higher number of imputed SNPs than HapMap III and performed as well or better than the best combined HapMap III reference panels and previous releases of the 1000 Genomes Project.

Publication types

Evaluation Study
Research Support, N.I.H., Extramural

MeSH terms

Algorithms*
Atherosclerosis
Black or African American / genetics*
Chromosomes, Human / genetics
Gene Frequency
Genetic Association Studies / methods*
Genetics, Population / methods
Genome, Human
Genotype
Genotyping Techniques / methods
HapMap Project
Homozygote
Humans
Polymorphism, Single Nucleotide*
Reproducibility of Results
Risk Factors
Software*

Abstract

Publication types

MeSH terms

Grants and funding