Practical issues in screening and variable selection in genome-wide association analysis

Sungyeon Hong; Yongkang Kim; Taesung Park

doi:10.4137/CIN.S16350

Practical issues in screening and variable selection in genome-wide association analysis

Cancer Inform. 2015 Jan 14;13(Suppl 7):55-65. doi: 10.4137/CIN.S16350. eCollection 2014.

Authors

Sungyeon Hong¹, Yongkang Kim¹, Taesung Park²

Affiliations

¹ Department of Statistics, Seoul National University, Seoul, South Korea.
² Department of Statistics, Seoul National University, Seoul, South Korea. ; Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, South Korea.

Abstract

Variable selection methods play an important role in high-dimensional statistical modeling and analysis. Computational cost and estimation accuracy are the two main concerns for statistical inference from ultrahigh-dimensional data. In particular, genome-wide association studies (GWAS), which focus on identifying single nucleotide polymorphisms (SNPs) associated with a disease of interest, have produced ultrahigh-dimensional data. Numerous methods have been proposed to handle GWAS data. Most statistical methods have adopted a two-stage approach: pre-screening for dimensional reduction and variable selection to identify causal SNPs. The pre-screening step selects SNPs in terms of their P-values or the absolute values of the regression coefficients in single SNP analysis. Penalized regressions, such as the ridge, lasso, adaptive lasso, and elastic-net regressions, are commonly used for the variable selection step. In this paper, we investigate which combination of pre-screening method and penalized regression performs best on a quantitative phenotype using two real GWAS datasets.

Keywords: genome-wide association study; penalized regression; the Age-Related Eye Disease Study (AREDS); the Korea Association Resource (KARE); variable selection.

Publication types

Review