Primary sequences of proteins from complete genomes display a singular periodicity: Alignment-free N-gram analysis

C R Biol. 2007 Jan;330(1):33-48. doi: 10.1016/j.crvi.2006.11.001. Epub 2006 Dec 1.

Abstract

A method is proposed to represent and to analyze complete genome sequences (52 species from procaryotes and eukaryotes), based upon n-gram sequence's frequencies of amino acid pairs (bigrams), separated by a given number of other residues. For each of the species analyzed, it allows us to construct over-abundant and over-deficient occurrence profiles, summarizing amino acid bigram frequencies over the entire genome. The method deals efficiently with a sparseness of statistical representations of individual sequences, and describes every gene sequence in the same way, independently of its length and of the genome sizes. The frequency of over-abundant and over-deficient occurrences of bigrams presents a singular periodicity around 3.5 peptide bonds, suggesting a relation with the alpha helical secondary structure.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Amino Acid Sequence
  • Archaea / genetics
  • Archaeal Proteins / chemistry
  • Archaeal Proteins / genetics
  • Bacteria / genetics
  • Bacterial Proteins / chemistry
  • Bacterial Proteins / genetics
  • Genome*
  • Proteins / chemistry*
  • Proteins / genetics
  • Sequence Alignment
  • Sequence Homology, Amino Acid

Substances

  • Archaeal Proteins
  • Bacterial Proteins
  • Proteins