Representative distance: a new similarity measure for class discovery from gene expression data

Zhiwen Yu; Jane You; Le Li; Hau-San Wong; Guoqiang Han

doi:10.1109/TNB.2012.2208198

Representative distance: a new similarity measure for class discovery from gene expression data

IEEE Trans Nanobioscience. 2012 Dec;11(4):341-51. doi: 10.1109/TNB.2012.2208198. Epub 2012 Aug 6.

Authors

Zhiwen Yu¹, Jane You, Le Li, Hau-San Wong, Guoqiang Han

Affiliation

¹ School of Computer Science and Engineering, South China University of Technology, Guangzhou, China. zhwyu@scut.edu.cn

PMID: 22893451
DOI: 10.1109/TNB.2012.2208198

Abstract

Similarity measurement is one of the most important stages in the process of cancer discovery from gene expression data. Traditional distance functions, such as the Euclidean distance, the correlation coefficient measure, the cosine distance, and so on, are selected to quantify the similarity between two cancer samples. However, these measures do not take into account the properties of cancer samples and do not consider the relationships among the genes in gene expression data. In order to explore the properties of cancer samples and the relationships among genes, we design a new similarity measure called representative distance (RD) to identify cancer samples in gene expression data. Specifically, RD does not compute the distance between two cancer samples using all the genes, but only calculates the similarity using representative genes selected by the affinity propagation algorithm. Then, a similarity matrix is constructed based on the representative distance. Finally, the spectral clustering algorithm is adopted to partition the similarity matrix, and discover the biological meaningful samples. To our knowledge, this is the first time in which the representative distance is applied to class discovery for gene expression data. Experiments on real cancer datasets indicate that our similarity measure can i) outperform most of the traditional distance measures, ii) identify cancer samples correctly in most of the datasets.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Cluster Analysis
Gene Expression Profiling
Gene Expression Regulation, Neoplastic*
Neoplasms / genetics*
Oligonucleotide Array Sequence Analysis