Informative gene selection and the direct classification of tumors based on relative simplicity

BMC Bioinformatics. 2016 Jan 20:17:44. doi: 10.1186/s12859-016-0893-0.

Abstract

Background: Selecting a parsimonious set of informative genes to build highly generalized performance classifier is the most important task for the analysis of tumor microarray expression data. Many existing gene pair evaluation methods cannot highlight diverse patterns of gene pairs only used one strategy of vertical comparison and horizontal comparison, while individual-gene-ranking method ignores redundancy and synergy among genes.

Results: Here we proposed a novel score measure named relative simplicity (RS). We evaluated gene pairs according to integrating vertical comparison with horizontal comparison, finally built RS-based direct classifier (RS-based DC) based on a set of informative genes capable of binary discrimination with a paired votes strategy. Nine multi-class gene expression datasets involving human cancers were used to validate the performance of new method. Compared with the nine reference models, RS-based DC received the highest average independent test accuracy (91.40%), the best generalization performance and the smallest informative average gene number (20.56). Compared with the four reference feature selection methods, RS also received the highest average test accuracy in three classifiers (Naïve Bayes, k-Nearest Neighbor and Support Vector Machine), and only RS can improve the performance of SVM.

Conclusions: Diverse patterns of gene pairs could be highlighted more fully while integrating vertical comparison with horizontal comparison strategy. DC core classifier can effectively control over-fitting. RS-based feature selection method combined with DC classifier can lead to more robust selection of informative genes and classification accuracy.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Bayes Theorem
  • Databases, Genetic
  • Genes, Neoplasm*
  • Humans
  • Models, Molecular
  • Neoplasms / classification*
  • Neoplasms / diagnosis
  • Neoplasms / genetics*
  • Support Vector Machine