TCLUST: a fast method for clustering genome-scale expression data

Banu Dost; Chunlei Wu; Andrew Su; Vineet Bafna

doi:10.1109/TCBB.2010.34

TCLUST: a fast method for clustering genome-scale expression data

IEEE/ACM Trans Comput Biol Bioinform. 2011 May-Jun;8(3):808-18. doi: 10.1109/TCBB.2010.34.

Authors

Banu Dost¹, Chunlei Wu, Andrew Su, Vineet Bafna

Affiliation

¹ Department of Computer Science and Engineering, University of California, San Diego, CA 92093, USA. bdost@cs.ucsd.edu

PMID: 20479508
DOI: 10.1109/TCBB.2010.34

Abstract

Genes with a common function are often hypothesized to have correlated expression levels in mRNA expression data, motivating the development of clustering algorithms for gene expression data sets. We observe that existing approaches do not scale well for large data sets, and indeed did not converge for the data set considered here. We present a novel clustering method TCLUST that exploits coconnectedness to efficiently cluster large, sparse expression data. We compare our approach with two existing clustering methods CAST and K-means which have been previously applied to clustering of gene-expression data with good performance results. Using a number of metrics, TCLUST is shown to be superior to or at least competitive with the other methods, while being much faster. We have applied this clustering algorithm to a genome-scale gene-expression data set and used gene set enrichment analysis to discover highly significant biological clusters. (Source code for TCLUST is downloadable at http://www.cse.ucsd.edu/~bdost/tclust.)

Publication types

Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms
Animals
Cluster Analysis*
Computer Simulation
Databases, Genetic*
Gene Expression Profiling / methods*
Genomics / methods*
Mice
Mice, Inbred Strains
Models, Molecular
Oligonucleotide Array Sequence Analysis*