Comparison of variable selection methods for high-dimensional survival data with competing events

Comput Biol Med. 2017 Dec 1:91:159-167. doi: 10.1016/j.compbiomed.2017.10.021. Epub 2017 Oct 20.

Abstract

Background: In the era of personalized medicine, it's primordial to identify gene signatures for each event type in the context of competing risks in order to improve risk stratification and treatment strategy. Until recently, little attention was paid to the performance of high-dimensional selection in deriving molecular signatures in this context. In this paper, we investigate the performance of two selection methods developed in the framework of high-dimensional data and competing risks: Random survival forest and a boosting approach for fitting proportional subdistribution hazards models.

Methods: Using data from bladder cancer patients (GSE5479) and simulated datasets, stability and prognosis performance of the two methods were evaluated using a resampling strategy. For each sample, the data set was split into 100 training and validation sets. Molecular signatures were developed in the training sets by the two selection methods and then applied on the corresponding validation sets.

Results: Random survival forest and boosting approach have comparable performance for the prediction of survival data, with few selected genes in common. Nevertheless, many different sets of genes are identified by the resampling approach, with a very small frequency of genes occurrence among the signatures. Also, the smaller the training sample size, the lower is the stability of the signatures.

Conclusion: Random survival forest and boosting approach give good predictive performance but gene signatures are very unstable. Further works are needed to propose adequate strategies for the analysis of high-dimensional data in the context of competing risks.

Keywords: Boosting; Competing risks; High-dimensional data; Random survival forest; Stability; Variable selection.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Databases, Factual
  • Gene Expression Profiling
  • Humans
  • Models, Statistical
  • Precision Medicine / methods
  • Survival Analysis*
  • Urinary Bladder Neoplasms / epidemiology
  • Urinary Bladder Neoplasms / genetics
  • Urinary Bladder Neoplasms / metabolism
  • Urinary Bladder Neoplasms / mortality