Variable selection method for the identification of epistatic models

Emily Rose Holzinger; Silke Szymczak; Abhijit Dasgupta; James Malley; Qing Li; Joan E Bailey-Wilson

Variable selection method for the identification of epistatic models

Pac Symp Biocomput. 2015:20:195-206.

Authors

Emily Rose Holzinger¹, Silke Szymczak, Abhijit Dasgupta, James Malley, Qing Li, Joan E Bailey-Wilson

Affiliation

¹ Computational and Statistical Genomics Branch (NHGRI, NIH), Baltimore, MD 21224, USA. emily.holzinger@nih.gov.

PMID: 25592581
PMCID: PMC4299919

Abstract

Standard analysis methods for genome wide association studies (GWAS) are not robust to complex disease models, such as interactions between variables with small main effects. These types of effects likely contribute to the heritability of complex human traits. Machine learning methods that are capable of identifying interactions, such as Random Forests (RF), are an alternative analysis approach. One caveat to RF is that there is no standardized method of selecting variables so that false positives are reduced while retaining adequate power. To this end, we have developed a novel variable selection method called relative recurrency variable importance metric (r2VIM). This method incorporates recurrency and variance estimation to assist in optimal threshold selection. For this study, we specifically address how this method performs in data with almost completely epistatic effects (i.e. no marginal effects). Our results show that with appropriate parameter settings, r2VIM can identify interaction effects when the marginal effects are virtually nonexistent. It also outperforms logistic regression, which has essentially no power under this type of model when the number of potential features (genetic variants) is large. (All Supplementary Data can be found here: http://research.nhgri.nih.gov/manuscripts/Bailey-Wilson/r2VIM_epi/).

Publication types

Research Support, N.I.H., Extramural
Research Support, N.I.H., Intramural

MeSH terms

Algorithms
Computational Biology
Computer Simulation
Databases, Genetic
Epistasis, Genetic*
Genome-Wide Association Study / statistics & numerical data
Humans
Linkage Disequilibrium
Logistic Models
Machine Learning
Models, Genetic*
Polymorphism, Single Nucleotide
Signal-To-Noise Ratio

Grants and funding

Z99 GM999999/Intramural NIH HHS/United States