Determination of local statistical significance of patterns in Markov sequences with application to promoter element identification

J Comput Biol. 2004;11(1):1-14. doi: 10.1089/106652704773416858.

Abstract

High-level eukaryotic genomes present a particular challenge to the computational identification of transcription factor binding sites (TFBSs) because of their long noncoding regions and large numbers of repeat elements. This is evidenced by the noisy results generated by most current methods. In this paper, we present a p-value-based scoring scheme using probability generating functions to evaluate the statistical significance of potential TFBSs. Furthermore, we introduce the local genomic context into the model so that candidate sites are evaluated based both on their similarities to known binding sites and on their contrasts against their respective local genomic contexts. We demonstrate that our approach is advantageous in the prediction of myogenin and MEF2 binding sites in the human genome. We also apply LMM to large-scale human binding site sequences in situ and found that, compared to current popular methods, LMM analysis can reduce false positive errors by more than 50% without compromising sensitivity. This improvement will be of importance to any subsequent algorithm that aims to detect regulatory modules based on known PSSMs.

Publication types

  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, Non-P.H.S.
  • Research Support, U.S. Gov't, P.H.S.

MeSH terms

  • Algorithms
  • Animals
  • Computational Biology / methods
  • DNA-Binding Proteins / genetics*
  • Genome
  • Humans
  • MEF2 Transcription Factors
  • Markov Chains*
  • Myogenic Regulatory Factors
  • Myogenin / genetics*
  • Promoter Regions, Genetic*
  • Protein Binding / genetics
  • Regulatory Sequences, Nucleic Acid*
  • Transcription Factors / genetics*

Substances

  • DNA-Binding Proteins
  • MEF2 Transcription Factors
  • MYOG protein, human
  • Myogenic Regulatory Factors
  • Myogenin
  • Transcription Factors