Variable selection in semi-parametric models

Stat Methods Med Res. 2016 Aug;25(4):1736-52. doi: 10.1177/0962280213499679. Epub 2013 Aug 28.

Abstract

We propose Bayesian variable selection methods in semi-parametric models in the framework of partially linear Gaussian and problit regressions. Reproducing kernels are utilized to evaluate possibly non-linear joint effect of a set of variables. Indicator variables are introduced into the reproducing kernels for the inclusion or exclusion of a variable. Different scenarios based on posterior probabilities of including a variable are proposed to select important variables. Simulations are used to demonstrate and evaluate the methods. It was found that the proposed methods can efficiently select the correct variables regardless of the feature of the effects, linear or non-linear in an unknown form. The proposed methods are applied to two real data sets to identify cytosine phosphate guanine methylation sites associated with maternal smoking and cytosine phosphate guanine sites associated with cotinine levels with creatinine levels adjusted. The selected methylation sites have the potential to advance our understanding of the underlying mechanism for the impact of smoking exposure on health outcomes, and consequently benefit medical research in disease intervention.

Keywords: Bayesian methods; Gaussian kernel; non-linear effects; partially linear regression; probit regression; reproducing kernel; variable selection.

MeSH terms

  • Bayes Theorem*
  • Cotinine / analysis
  • CpG Islands
  • Creatinine / metabolism
  • Cytochrome P-450 CYP1A1 / genetics
  • DNA Methylation*
  • Datasets as Topic
  • Epistasis, Genetic*
  • Female
  • Humans
  • Linear Models*
  • Mothers
  • Normal Distribution*
  • Pregnancy
  • Smoking / genetics*

Substances

  • Creatinine
  • Cytochrome P-450 CYP1A1
  • Cotinine