An ensemble-based likelihood ratio approach for family-based genomic risk prediction

J Zhejiang Univ Sci B. 2018;19(12):935-947. doi: 10.1631/jzus.B1800162.

Abstract

Objective: As one of the most popular designs used in genetic research, family-based design has been well recognized for its advantages, such as robustness against population stratification and admixture. With vast amounts of genetic data collected from family-based studies, there is a great interest in studying the role of genetic markers from the aspect of risk prediction. This study aims to develop a new statistical approach for family-based risk prediction analysis with an improved prediction accuracy compared with existing methods based on family history.

Methods: In this study, we propose an ensemble-based likelihood ratio (ELR) approach, Fam-ELR, for family-based genomic risk prediction. Fam-ELR incorporates a clustered receiver operating characteristic (ROC) curve method to consider correlations among family samples, and uses a computationally efficient tree-assembling procedure for variable selection and model building.

Results: Through simulations, Fam-ELR shows its robustness in various underlying disease models and pedigree structures, and attains better performance than two existing family-based risk prediction methods. In a real-data application to a family-based genome-wide dataset of conduct disorder, Fam-ELR demonstrates its ability to integrate potential risk predictors and interactions into the model for improved accuracy, especially on a genome-wide level.

Conclusions: By comparing existing approaches, such as genetic risk-score approach, Fam-ELR has the capacity of incorporating genetic variants with small or moderate marginal effects and their interactions into an improved risk prediction model. Therefore, it is a robust and useful approach for high-dimensional family-based risk prediction, especially on complex disease with unknown or less known disease etiology.

Keywords: Family-based study; Genetic risk prediction; High-dimensional data.

MeSH terms

  • Area Under Curve
  • Computer Simulation
  • Conduct Disorder / genetics*
  • Conduct Disorder / physiopathology
  • Family Health
  • Female
  • Genetic Markers
  • Genetic Predisposition to Disease*
  • Genetic Variation
  • Genome, Human*
  • Genome-Wide Association Study
  • Genomics*
  • Humans
  • Likelihood Functions
  • Male
  • Models, Genetic
  • Odds Ratio
  • Pedigree
  • ROC Curve
  • Reproducibility of Results
  • Risk Factors

Substances

  • Genetic Markers