Natural vs. random protein sequences: Discovering combinatorics properties on amino acid words

J Theor Biol. 2016 Feb 21:391:13-20. doi: 10.1016/j.jtbi.2015.11.022. Epub 2015 Dec 2.

Abstract

Casual mutations and natural selection have driven the evolution of protein amino acid sequences that we observe at present in nature. The question about which is the dominant force of proteins evolution is still lacking of an unambiguous answer. Casual mutations tend to randomize protein sequences while, in order to have the correct functionality, one expects that selection mechanisms impose rigid constraints on amino acid sequences. Moreover, one also has to consider that the space of all possible amino acid sequences is so astonishingly large that it could be reasonable to have a well tuned amino acid sequence indistinguishable from a random one. In order to study the possibility to discriminate between random and natural amino acid sequences, we introduce different measures of association between pairs of amino acids in a sequence, and apply them to a dataset of 1047 natural protein sequences and 10,470 random sequences, carefully generated in order to preserve the relative length and amino acid distribution of the natural proteins. We analyze the multidimensional measures with machine learning techniques and show that, to a reasonable extent, natural protein sequences can be differentiated from random ones.

Keywords: Amino acid association; Combinatorics of words; Protein sequence; Random sequence.

MeSH terms

  • Amino Acid Sequence
  • Evolution, Molecular*
  • Models, Genetic*
  • Proteins / chemistry*
  • Proteins / genetics*

Substances

  • Proteins