Optimality driven nearest centroid classification from genomic data

PLoS One. 2007 Oct 3;2(10):e1002. doi: 10.1371/journal.pone.0001002.

Abstract

Nearest-centroid classifiers have recently been successfully employed in high-dimensional applications, such as in genomics. A necessary step when building a classifier for high-dimensional data is feature selection. Feature selection is frequently carried out by computing univariate scores for each feature individually, without consideration for how a subset of features performs as a whole. We introduce a new feature selection approach for high-dimensional nearest centroid classifiers that instead is based on the theoretically optimal choice of a given number of features, which we determine directly here. This allows us to develop a new greedy algorithm to estimate this optimal nearest-centroid classifier with a given number of features. In addition, whereas the centroids are usually formed from maximum likelihood estimates, we investigate the applicability of high-dimensional shrinkage estimates of centroids. We apply the proposed method to clinical classification based on gene-expression microarrays, demonstrating that the proposed method can outperform existing nearest centroid classifiers.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Algorithms
  • Child
  • Data Interpretation, Statistical*
  • Discriminant Analysis
  • Gene Expression Profiling
  • Gene Expression Regulation, Neoplastic*
  • Genetic Techniques*
  • Genomics*
  • Humans
  • Leukemia
  • Lymphoma / genetics
  • Models, Statistical
  • Models, Theoretical
  • Oligonucleotide Array Sequence Analysis
  • Pattern Recognition, Automated