WeederH: an algorithm for finding conserved regulatory motifs and regions in homologous sequences

BMC Bioinformatics. 2007 Feb 7:8:46. doi: 10.1186/1471-2105-8-46.

Abstract

Background: This work addresses the problem of detecting conserved transcription factor binding sites and in general regulatory regions through the analysis of sequences from homologous genes, an approach that is becoming more and more widely used given the ever increasing amount of genomic data available.

Results: We present an algorithm that identifies conserved transcription factor binding sites in a given sequence by comparing it to one or more homologs, adapting a framework we previously introduced for the discovery of sites in sequences from co-regulated genes. Differently from the most commonly used methods, the approach we present does not need or compute an alignment of the sequences investigated, nor resorts to descriptors of the binding specificity of known transcription factors. The main novel idea we introduce is a relative measure of conservation, assuming that true functional elements should present a higher level of conservation with respect to the rest of the sequence surrounding them. We present tests where we applied the algorithm to the identification of conserved annotated sites in homologous promoters, as well as in distal regions like enhancers.

Conclusion: Results of the tests show how the algorithm can provide fast and reliable predictions of conserved transcription factor binding sites regulating the transcription of a gene, with better performances than other available methods for the same task. We also show examples on how the algorithm can be successfully employed when promoter annotations of the genes investigated are missing, or when regulatory sites and regions are located far away from the genes.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Base Sequence
  • Binding Sites
  • Chromosome Mapping / methods*
  • Conserved Sequence / genetics*
  • Molecular Sequence Data
  • Protein Binding
  • Regulatory Sequences, Nucleic Acid / genetics*
  • Sequence Analysis, DNA / methods*
  • Sequence Homology, Nucleic Acid*
  • Software
  • Transcription Factors / genetics*

Substances

  • Transcription Factors