Alignments anchored on genomic landmarks can aid in the identification of regulatory elements

Bioinformatics. 2005 Jun;21 Suppl 1(Suppl 1):i440-8. doi: 10.1093/bioinformatics/bti1028.

Abstract

Motivation: The transcription start site (TSS) has been located for an increasing number of genes across several organisms. Statistical tests have shown that some cis-acting regulatory elements have positional preferences with respect to the TSS, but few strategies have emerged for locating elements by their positional preferences. This paper elaborates such a strategy. First, we align promoter regions without gaps, anchoring the alignment on each promoter's TSS. Second, we apply a novel word-specific mask. Third, we apply a clustering test related to gapless BLAST statistics. The test examines whether any specific word is placed unusually consistently with respect to the TSS. Finally, our program A-GLAM, an extension of the GLAM program, uses significant word positions as new 'anchors' to realign the sequences. A Gibbs sampling algorithm then locates putative cis-acting regulatory elements. Usually, Gibbs sampling requires a preliminary masking step, to avoid convergence onto a dominant but uninteresting signal from a DNA repeat. However, since the positional anchors focus A-GLAM on the motif of interest, masking DNA repeats during Gibbs sampling becomes unnecessary.

Results: In a set of human DNA sequences with experimentally characterized TSSs, the placement of 791 octonucleotide words was unusually consistent (multiple test corrected P < 0.05). Alignments anchored on these words sometimes located statistically significant motifs inaccessible to GLAM or AlignACE.

Availability: The A-GLAM program and a list of statistically significant words are available at ftp://ftp.ncbi.nih.gov/pub/spouge/papers/archive/AGLAM/.

MeSH terms

  • Amino Acid Motifs
  • Base Sequence
  • Cluster Analysis
  • Computational Biology / methods*
  • DNA / chemistry
  • Databases, Protein
  • Genomics / methods*
  • Humans
  • Models, Statistical
  • Molecular Sequence Data
  • Nucleotides / chemistry
  • Promoter Regions, Genetic
  • Regulatory Sequences, Nucleic Acid
  • Sequence Alignment
  • Software
  • Transcription Initiation Site

Substances

  • Nucleotides
  • DNA