A comparative analysis of methods for predicting clinical outcomes using high-dimensional genomic datasets

J Am Med Inform Assoc. 2014 Oct;21(e2):e312-9. doi: 10.1136/amiajnl-2013-002358. Epub 2014 Apr 15.

Abstract

Objective: The objective of this investigation is to evaluate binary prediction methods for predicting disease status using high-dimensional genomic data. The central hypothesis is that the Bayesian network (BN)-based method called efficient Bayesian multivariate classifier (EBMC) will do well at this task because EBMC builds on BN-based methods that have performed well at learning epistatic interactions.

Method: We evaluate how well eight methods perform binary prediction using high-dimensional discrete genomic datasets containing epistatic interactions. The methods are as follows: naive Bayes (NB), model averaging NB (MANB), feature selection NB (FSNB), EBMC, logistic regression (LR), support vector machines (SVM), Lasso, and extreme learning machines (ELM). We use a hundred 1000-single nucleotide polymorphism (SNP) simulated datasets, ten 10,000-SNP datasets, six semi-synthetic sets, and two real genome-wide association studies (GWAS) datasets in our evaluation.

Results: In fivefold cross-validation studies, the SVM performed best on the 1000-SNP dataset, while the BN-based methods performed best on the other datasets, with EBMC exhibiting the best overall performance. In-sample testing indicates that LR, SVM, Lasso, ELM, and NB tend to overfit the data.

Discussion: EBMC performed better than NB when there are several strong predictors, whereas NB performed better when there are many weak predictors. Furthermore, for all BN-based methods, prediction capability did not degrade as the dimension increased.

Conclusions: Our results support the hypothesis that EBMC performs well at binary outcome prediction using high-dimensional discrete datasets containing epistatic-like interactions. Future research using more GWAS datasets is needed to further investigate the potential of EBMC.

Keywords: Bayesian Classifier; Bayesian Networks; Epistasis; Genomics; High-Dimensional Data; Prediction.

Publication types

  • Comparative Study
  • Evaluation Study
  • Research Support, N.I.H., Extramural

MeSH terms

  • Bayes Theorem*
  • Databases, Genetic*
  • Epistasis, Genetic
  • Genome-Wide Association Study
  • Genomics*
  • Humans
  • Neural Networks, Computer*
  • Prognosis
  • ROC Curve