Genrate: a generative model that finds and scores new genes and exons in genomic microarray data

Brendan J Frey; Quaid D Morris; Wen Zhang; Naveed Mohammad; Timothy R Hughes

Genrate: a generative model that finds and scores new genes and exons in genomic microarray data

Pac Symp Biocomput. 2005:495-506.

Authors

Brendan J Frey¹, Quaid D Morris, Wen Zhang, Naveed Mohammad, Timothy R Hughes

Affiliation

¹ Dept of Electrical and Computer Engineering, University of Toronto, Toronto, ON, M5S 3G4, Canada. frey@psi.toronto.edu

PMID: 15759654

Abstract

Recently, researchers have made some progress in using microarrays to validate predicted exons in genome sequence and find new gene structures. However, current methods rely on separately making threshold-based decisions on intensity of expression, similarity of expression profiles, and arrangements of exons in the genome. We have taken a Bayesian approach and developed GenRate, a generative model that accounts for both genome-wide expression data taken from multiple conditions (e.g. tissues) and co-location and density of probes in DNA sequence data. GenRate balances probabilistic evidence derived from different sources and outputs scores (log-likelihoods) for each gene model, enabling the estimation of false-positive and false-negative rates. The model has a number of local minima that is exponential in the length of the DNA sequence data, so direct application of the EM learning algorithm produces poor results. We describe a novel way of parameterizing the model using examples from the data set, so that good solutions are found using an efficient algorithm. We apply GenRate to a subset of mouse genome-wide expression data that we have created, and discuss the statistical significance of the genes found by GenRate. Three of the highest-ranking gene structures found by GenRate, each containing thousands of bases from the genome, are confirmed using RT-PCR experiments.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Animals
Computational Biology / methods
Databases, Nucleic Acid
Genomics*
Mice
Models, Genetic*
Oligonucleotide Array Sequence Analysis*
Probability
Software