Recurrence time statistics: versatile tools for genomic DNA sequence analysis

J Bioinform Comput Biol. 2005 Jun;3(3):677-96. doi: 10.1142/s0219720005001235.

Abstract

With the completion of the human and a few model organisms' genomes, and with the genomes of many other organisms waiting to be sequenced, it has become increasingly important to develop faster computational tools which are capable of easily identifying the structures and extracting features from DNA sequences. One of the more important structures in a DNA sequence is repeat-related. Often they have to be masked before protein coding regions along a DNA sequence are to be identified or redundant expressed sequence tags (ESTs) are to be sequenced. Here we report a novel recurrence time-based method for sequence analysis. The method can conveniently study all kinds of periodicity and exhaustively find all repeat-related features from a genomic DNA sequence. An efficient codon index is also derived from the recurrence time statistics, which has the salient features of being largely species-independent and working well on very short sequences. Efficient codon indices are key elements of successful gene finding algorithms, and are particularly useful for determining whether a suspected EST belongs to a coding or non-coding region. We illustrate the power of the method by studying the genomes of E. coli, the yeast S. cervisivae, the nematode worm C. elegans, and the human, Homo sapiens. Our method requires approximately 6 . N byte memory and a computational time of N log N to extract all the repeat-related and periodic or quasi-periodic features from a sequence of length N without any prior knowledge on the consensus sequence of those features, hence enables us to carry out sequence analysis on the whole genomic scale by a PC.

Publication types

  • Evaluation Study

MeSH terms

  • Algorithms*
  • Chromosome Mapping / methods*
  • Computer Simulation
  • Data Interpretation, Statistical
  • Models, Genetic*
  • Models, Statistical
  • Repetitive Sequences, Nucleic Acid / genetics*
  • Sequence Analysis, DNA / methods*
  • Software*
  • Time Factors