A relative-entropy algorithm for genomic fingerprinting captures host-phage similarities

J Bacteriol. 2005 Dec;187(24):8370-4. doi: 10.1128/JB.187.24.8370-8374.2005.

Abstract

The degeneracy of codons allows a multitude of possible sequences to code for the same protein. Hidden within the particular choice of sequence for each organism are over 100 previously undiscovered biologically significant, short oligonucleotides (length, 2 to 7 nucleotides). We present an information-theoretic algorithm that finds these novel signals. Applying this algorithm to the 209 sequenced bacterial genomes in the NCBI database, we determine a set of oligonucleotides for each bacterium which uniquely characterizes the organism. Some of these signals have known biological functions, like restriction enzyme binding sites, but most are new. An accompanying scoring algorithm is introduced that accurately (92%) places sequences of 100 kb with their correct species among the choice of hundreds. This algorithm also does far better than previous methods at relating phage genomes to their bacterial hosts, suggesting that the lists of oligonucleotides are "genomic fingerprints" that encode information about the effects of the cellular environment on DNA sequence. Our approach provides a novel basis for phylogeny and is potentially ideally suited for classifying the short DNA fragments obtained by environmental shotgun sequencing. The methods developed here can be readily extended to other problems in bioinformatics.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Bacteria / classification
  • Bacteria / genetics*
  • Bacteriophages / genetics*
  • Computational Biology*
  • DNA Fingerprinting
  • DNA, Bacterial / genetics*
  • DNA, Viral / genetics
  • Databases, Genetic
  • Genome, Bacterial*
  • Genome, Viral
  • Genomics*
  • Oligodeoxyribonucleotides / genetics*
  • Phylogeny
  • Sequence Homology, Nucleic Acid

Substances

  • DNA, Bacterial
  • DNA, Viral
  • Oligodeoxyribonucleotides