Statistical approach for selection of biologically informative genes

Gene. 2018 May 20:655:71-83. doi: 10.1016/j.gene.2018.02.044. Epub 2018 Feb 16.

Abstract

Selection of informative genes from high dimensional gene expression data has emerged as an important research area in genomics. Many gene selection techniques have been proposed so far are either based on relevancy or redundancy measure. Further, the performance of these techniques has been adjudged through post selection classification accuracy computed through a classifier using the selected genes. This performance metric may be statistically sound but may not be biologically relevant. A statistical approach, i.e. Boot-MRMR, was proposed based on a composite measure of maximum relevance and minimum redundancy, which is both statistically sound and biologically relevant for informative gene selection. For comparative evaluation of the proposed approach, we developed two biological sufficient criteria, i.e. Gene Set Enrichment with QTL (GSEQ) and biological similarity score based on Gene Ontology (GO). Further, a systematic and rigorous evaluation of the proposed technique with 12 existing gene selection techniques was carried out using five gene expression datasets. This evaluation was based on a broad spectrum of statistically sound (e.g. subject classification) and biological relevant (based on QTL and GO) criteria under a multiple criteria decision-making framework. The performance analysis showed that the proposed technique selects informative genes which are more biologically relevant. The proposed technique is also found to be quite competitive with the existing techniques with respect to subject classification and computational time. Our results also showed that under the multiple criteria decision-making setup, the proposed technique is best for informative gene selection over the available alternatives. Based on the proposed approach, an R Package, i.e. BootMRMR has been developed and available at https://cran.r-project.org/web/packages/BootMRMR. This study will provide a practical guide to select statistical techniques for selecting informative genes from high dimensional expression data for breeding and system biology studies.

Keywords: Boot-MRMR; Bootstrap; Gene Set Enrichment with QTLs; Gene sampling; Informative genes; Subject sampling.

MeSH terms

  • Algorithms*
  • Computational Biology / methods
  • Data Interpretation, Statistical*
  • Gene Expression Profiling / methods
  • Gene Expression Profiling / statistics & numerical data*
  • Gene Ontology
  • Genes* / physiology
  • Genome, Human
  • Genomics / methods
  • Genomics / statistics & numerical data
  • Humans
  • Oligonucleotide Array Sequence Analysis / methods
  • Oligonucleotide Array Sequence Analysis / statistics & numerical data*
  • Sample Size