A Bayesian nonparametric approach for comparing clustering structures in EST libraries

J Comput Biol. 2008 Dec;15(10):1315-27. doi: 10.1089/cmb.2008.0043.

Abstract

Inference for Expressed Sequence Tags (ESTs) data is considered. We focus on evaluating the redundancy of a cDNA library and, more importantly, on comparing different libraries on the basis of their clustering structure. The numerical results we achieve allow us to assess the effect of an error correction procedure for EST data and to study the compatibility of single EST libraries with respect to merged ones. The proposed method is based on a Bayesian nonparametric approach that allows to understand the clustering mechanism that generates the observed data. As specific nonparametric model we use the two parameter Poisson-Dirichlet (PD) process. The PD process represents a tractable nonparametric prior which is a natural candidate for modeling data arising from discrete distributions. It allows prediction and testing in order to analyze the clustering structure featured by the data. We show how a full Bayesian analysis can be performed and describe the corresponding computational algorithm.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Base Sequence
  • Bayes Theorem*
  • Cluster Analysis
  • Expressed Sequence Tags*
  • Gene Library*
  • Molecular Sequence Data
  • Sequence Analysis, DNA / methods