A Feature Sampling Strategy for Analysis of High Dimensional Genomic Data

IEEE/ACM Trans Comput Biol Bioinform. 2019 Mar-Apr;16(2):434-441. doi: 10.1109/TCBB.2017.2779492. Epub 2017 Dec 4.

Abstract

With the development of high throughput technology, it has become feasible and common to profile tens of thousands of gene activities simultaneously. These genomic data typically have sample size of hundreds or fewer, which is much less than the feature size (number of genes). In addition, the genes, in particular the ones from the same pathway, are often highly correlated. These issues impose a great challenge for selecting meaningful genes from a large number of (correlated) candidates in many genomic studies. Quite a few methods have been proposed to attack this challenge. Among them, regularization-based techniques, e.g., lasso, become much more appealing, because they can do model fitting and variable selection at the same time. However, the lasso regression has its known limitations. One is that the number of genes selected by the lasso couldn't exceed the number of samples. Another limitation is that, if causal genes are highly correlated, the lasso tends to select only one or few genes from them. Biologists, however, desire to identify them all. To overcome these limitations, we present here a novel, robust, and stable variable selection method. Through simulation studies and a real application to the transcriptome data, we demonstrate the superiority of the proposed method in selecting highly correlated causal genes. We also provide some theoretical justifications for this feature sampling strategy based on the mean and variance analyses.

MeSH terms

  • Breast Neoplasms / genetics
  • Computer Simulation
  • Databases, Genetic*
  • Female
  • Gene Expression Profiling
  • Genomics / methods*
  • Humans
  • Linear Models
  • Transcriptome / genetics