Binary analysis and optimization-based normalization of gene expression data

Bioinformatics. 2002 Apr;18(4):555-65. doi: 10.1093/bioinformatics/18.4.555.

Abstract

Motivation: Most approaches to gene expression analysis use real-valued expression data, produced by high-throughput screening technologies, such as microarrays. Often, some measure of similarity must be computed in order to extract meaningful information from the observed data. The choice of this similarity measure frequently has a profound effect on the results of the analysis, yet no standards exist to guide the researcher.

Results: To address this issue, we propose to analyse gene expression data entirely in the binary domain. The natural measure of similarity becomes the Hamming distance and reflects the notion of similarity used by biologists. We also develop a novel data-dependent optimization-based method, based on Genetic Algorithms (GAs), for normalizing gene expression data. This is a necessary step before quantizing gene expression data into the binary domain and generally, for comparing data between different arrays. We then present an algorithm for binarizing gene expression data and illustrate the use of the above methods on two different sets of data. Using Multidimensional Scaling, we show that a reasonable degree of separation between different tumor types in each data set can be achieved by working solely in the binary domain. The binary approach offers several advantages, such as noise resilience and computational efficiency, making it a viable approach to extracting meaningful biological information from gene expression data.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Cluster Analysis*
  • Gene Expression Profiling / methods*
  • Glioma / classification
  • Glioma / genetics
  • Humans
  • Leiomyosarcoma / classification
  • Leiomyosarcoma / genetics
  • Models, Genetic*
  • Models, Statistical*
  • Oligonucleotide Array Sequence Analysis / methods*
  • Pattern Recognition, Automated
  • Reproducibility of Results
  • Sensitivity and Specificity
  • Statistics as Topic
  • Statistics, Nonparametric