Robust complementary hierarchical clustering for gene expression data analysis by β-divergence

J Biosci Bioeng. 2013 Sep;116(3):397-407. doi: 10.1016/j.jbiosc.2013.03.010. Epub 2013 Apr 19.

Abstract

A hierarchical clustering (HC) algorithm is one of the most widely used unsupervised statistical techniques for analyzing microarray gene expression data. When applying the HC algorithm to the gene expression data to cluster individuals, most of the HC algorithms generate clusters based on the highly differentially expressed (DE) genes that have very similar expression patterns. These highly DE genes may sometimes be irrelevant in biological processes. The serious problem is that those irrelevant genes with high expressions potentially drown out the low expressed genes that have important biological functions. To overcome the problem, Nowak and Tibshirani proposed the complementary hierarchical clustering (CHC) (Biostatistics, 9, 467-483, 2008). However, it is not robust against outlying expression and often produces misleading results if there exist some contaminations in the gene expression data. Thus, we propose the robust CHC (RCHC) method to robustify the CHC with respect to outliers by maximizing the β-likelihood function for sequential extraction of a gene-set with proper groups of individuals. Note that the proposed method reduces to the CHC with the tuning parameter β → 0. A value of β plays a key role in the performance of the RCHC method, which controls the tradeoff between the robustness and efficiency of the estimators. Using simulation and real gene expression analysis, the RCHC method shows robust properties to gene expression clustering with respect to data contaminations, overcomes the problem of the CHC, and predicts critically important genes from breast cancer data.

Keywords: DNA microarray; Gene expression; Maximum β-likelihood; Relative gene importance; Robust complementary hierarchical clustering (RCHC); Robustness; Selection procedure of β.

MeSH terms

  • Algorithms
  • Breast Neoplasms / genetics
  • Cluster Analysis
  • Female
  • Gene Expression Profiling / methods*
  • Gene Expression Regulation, Neoplastic
  • Humans
  • Models, Genetic
  • Oligonucleotide Array Sequence Analysis / methods*