Binary analysis and optimization-based normalization of gene expression data

Ilya Shmulevich; Wei Zhang

doi:10.1093/bioinformatics/18.4.555

Binary analysis and optimization-based normalization of gene expression data

Bioinformatics. 2002 Apr;18(4):555-65. doi: 10.1093/bioinformatics/18.4.555.

Authors

Ilya Shmulevich¹, Wei Zhang

Affiliation

¹ Cancer Genomics Laboratory, Department of Pathology, University of Texas MD Anderson Cancer Center, 1515 Holcombe Blvd, Box 85, Houston, TX 77030, USA. is@ieee.org

PMID: 12016053
DOI: 10.1093/bioinformatics/18.4.555

Abstract

Motivation: Most approaches to gene expression analysis use real-valued expression data, produced by high-throughput screening technologies, such as microarrays. Often, some measure of similarity must be computed in order to extract meaningful information from the observed data. The choice of this similarity measure frequently has a profound effect on the results of the analysis, yet no standards exist to guide the researcher.

Results: To address this issue, we propose to analyse gene expression data entirely in the binary domain. The natural measure of similarity becomes the Hamming distance and reflects the notion of similarity used by biologists. We also develop a novel data-dependent optimization-based method, based on Genetic Algorithms (GAs), for normalizing gene expression data. This is a necessary step before quantizing gene expression data into the binary domain and generally, for comparing data between different arrays. We then present an algorithm for binarizing gene expression data and illustrate the use of the above methods on two different sets of data. Using Multidimensional Scaling, we show that a reasonable degree of separation between different tumor types in each data set can be achieved by working solely in the binary domain. The binary approach offers several advantages, such as noise resilience and computational efficiency, making it a viable approach to extracting meaningful biological information from gene expression data.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Cluster Analysis*
Gene Expression Profiling / methods*
Glioma / classification
Glioma / genetics
Humans
Leiomyosarcoma / classification
Leiomyosarcoma / genetics
Models, Genetic*
Models, Statistical*
Oligonucleotide Array Sequence Analysis / methods*
Pattern Recognition, Automated
Reproducibility of Results
Sensitivity and Specificity
Statistics as Topic
Statistics, Nonparametric