An algorithm to infer similarity among cell types and organisms by examining the most expressed sequences

S A P Pinto; J M Ortega

doi:10.4238/vol7-3x-meeting010

An algorithm to infer similarity among cell types and organisms by examining the most expressed sequences

Genet Mol Res. 2008 Sep 30;7(3):933-47. doi: 10.4238/vol7-3x-meeting010.

Authors

S A P Pinto¹, J M Ortega

Affiliation

¹ Departamento de Bioquímica e Imunologia, Instituto de Informática/Barreiro, Pontifícia Universidade Católica de Minas Gerais, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, MG, Brasil.

PMID: 18949711
DOI: 10.4238/vol7-3x-meeting010

Abstract

Following sequence alignment, clustering algorithms are among the most utilized techniques in gene expression data analysis. Clustering gene expression patterns allows researchers to determine which gene expression patterns are alike and most likely to participate in the same biological process being investigated. Gene expression data also allow the clustering of whole samples of data, which makes it possible to find which samples are similar and, consequently, which sampled biological conditions are alike. Here, a novel similarity measure calculation and the resulting rank-based clustering algorithm are presented. The clustering was applied in 418 gene expression samples from 13 data series spanning three model organisms: Homo sapiens, Mus musculus, and Arabidopsis thaliana. The initial results are striking: more than 91% of the samples were clustered as expected. The MESs (most expressed sequences) approach outperformed some of the most used clustering algorithms applied to this kind of data such as hierarchical clustering and K-means. The clustering performance suggests that the new similarity measure is an alternative to the traditional correlation/distance measures typically used in clustering algorithms.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Animals
Arabidopsis / cytology
Arabidopsis / genetics
Cluster Analysis*
Gene Expression Profiling / methods
Gene Expression Profiling / statistics & numerical data*
Humans
Mice
Species Specificity