Risk Classification with an Adaptive Naive Bayes Kernel Machine Model

Jessica Minnier; Ming Yuan; Jun S Liu; Tianxi Cai

doi:10.1080/01621459.2014.908778

Risk Classification with an Adaptive Naive Bayes Kernel Machine Model

J Am Stat Assoc. 2015 Apr 22;110(509):393-404. doi: 10.1080/01621459.2014.908778.

Authors

Jessica Minnier¹, Ming Yuan², Jun S Liu³, Tianxi Cai⁴

Affiliations

¹ Assistant Professor, Department of Public Health & Preventive Medicine, Oregon Health & Science University, Portland, OR 97239.
² Professor, Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706.
³ Professor, Department of Statistics, Harvard University, Cambridge, MA 02138.
⁴ Professor, Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115.

Abstract

Genetic studies of complex traits have uncovered only a small number of risk markers explaining a small fraction of heritability and adding little improvement to disease risk prediction. Standard single marker methods may lack power in selecting informative markers or estimating effects. Most existing methods also typically do not account for non-linearity. Identifying markers with weak signals and estimating their joint effects among many non-informative markers remains challenging. One potential approach is to group markers based on biological knowledge such as gene structure. If markers in a group tend to have similar effects, proper usage of the group structure could improve power and efficiency in estimation. We propose a two-stage method relating markers to disease risk by taking advantage of known gene-set structures. Imposing a naive bayes kernel machine (KM) model, we estimate gene-set specific risk models that relate each gene-set to the outcome in stage I. The KM framework efficiently models potentially non-linear effects of predictors without requiring explicit specification of functional forms. In stage II, we aggregate information across gene-sets via a regularization procedure. Estimation and computational efficiency is further improved with kernel principle component analysis. Asymptotic results for model estimation and gene set selection are derived and numerical studies suggest that the proposed procedure could outperform existing procedures for constructing genetic risk models.

Keywords: Gene-set analysis; Genetic association; Genetic pathways; Kernel PCA; Kernel machine regression; Principal component analysis; Risk prediction.

Abstract

Grants and funding