An algorithm to infer similarity among cell types and organisms by examining the most expressed sequences

Genet Mol Res. 2008 Sep 30;7(3):933-47. doi: 10.4238/vol7-3x-meeting010.

Abstract

Following sequence alignment, clustering algorithms are among the most utilized techniques in gene expression data analysis. Clustering gene expression patterns allows researchers to determine which gene expression patterns are alike and most likely to participate in the same biological process being investigated. Gene expression data also allow the clustering of whole samples of data, which makes it possible to find which samples are similar and, consequently, which sampled biological conditions are alike. Here, a novel similarity measure calculation and the resulting rank-based clustering algorithm are presented. The clustering was applied in 418 gene expression samples from 13 data series spanning three model organisms: Homo sapiens, Mus musculus, and Arabidopsis thaliana. The initial results are striking: more than 91% of the samples were clustered as expected. The MESs (most expressed sequences) approach outperformed some of the most used clustering algorithms applied to this kind of data such as hierarchical clustering and K-means. The clustering performance suggests that the new similarity measure is an alternative to the traditional correlation/distance measures typically used in clustering algorithms.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Animals
  • Arabidopsis / cytology
  • Arabidopsis / genetics
  • Cluster Analysis*
  • Gene Expression Profiling / methods
  • Gene Expression Profiling / statistics & numerical data*
  • Humans
  • Mice
  • Species Specificity