Batch effects in the BRLMM genotype calling algorithm influence GWAS results for the Affymetrix 500K array

K Miclaus; R Wolfinger; S Vega; M Chierici; C Furlanello; C Lambert; H Hong; Li Zhang; S Yin; F Goodsaid

doi:10.1038/tpj.2010.36

Batch effects in the BRLMM genotype calling algorithm influence GWAS results for the Affymetrix 500K array

Pharmacogenomics J. 2010 Aug;10(4):336-46. doi: 10.1038/tpj.2010.36.

Authors

K Miclaus¹, R Wolfinger, S Vega, M Chierici, C Furlanello, C Lambert, H Hong, Li Zhang, S Yin, F Goodsaid

Affiliation

¹ SAS Institute, Cary, NC 27513, USA. Kelci.Miclaus@sas.com

PMID: 20676071
DOI: 10.1038/tpj.2010.36

Abstract

The Affymetrix GeneChip Human Mapping 500K array is common for genome-wide association studies (GWASs). Recent findings highlight the importance of accurate genotype calling algorithms to reduce the inflation in Type I and Type II error rates. Differential results due to genotype calling errors can introduce severe bias in case-control association study results. Using data from the Wellcome Trust Case Control Consortium, 1991 individuals with coronary artery disease (CAD) and 1500 controls from the UK Blood Services (NBS) were genotyped on the Affymetrix 500K array. Different batch sizes and compositions were used in the Bayesian Robust Linear Model with Mahalanobis distance classifier (BRLMM) genotype calling algorithm to assess the batch effect on downstream association analysis. Results show that composition (cases and controls genotyped simultaneously or separate) and size (number of individuals processed by BRLMM at a time) can create 2-3% discordance in the results for quality control and statistical analysis and may contribute to the lack of reproducibility between GWASs. The changes in batch size are largely responsible for differential single-nucleotide polymorphism results, yet we observe evidence of an interactive effect of batch size and composition that contributes to discordant results in the list of significantly associated loci.

MeSH terms

Algorithms*
Case-Control Studies
Coronary Artery Disease / genetics
Databases, Genetic
Genome-Wide Association Study / statistics & numerical data*
Genotype*
Humans
Linear Models
Models, Statistical
Odds Ratio
Oligonucleotide Array Sequence Analysis / standards*
Oligonucleotide Array Sequence Analysis / statistics & numerical data*
Polymorphism, Single Nucleotide
Predictive Value of Tests
Quality Control