A mathematical consideration of the word-composition vector method in comparison of biological sequences

Biosystems. 2011 Nov;106(2-3):67-75. doi: 10.1016/j.biosystems.2011.06.009. Epub 2011 Jul 1.

Abstract

To measure the similarity or dissimilarity between two given biological sequences, several papers proposed metrics based on the "word-composition vector". The essence of these metrics is as follows. First, we count the appearance frequencies of all the K-tuple words throughout each of two given sequences. Then, the two given sequences are transformed into their respective word-composition vectors. Next, the distance metrics, for example the angle between the two vectors, are calculated. A significant issue is to determine the optimal word size K. With a mathematical model of mutational events (including substitutions, insertions, deletions and duplications) that occur in sequences, we analyzed how the angle between the composition vectors depends on the mutational events. We also considered the optimal word size (=resolution) from our original approach. Our results were verified by computational experiments using artificially generated sequences, amino acid sequences of hemoglobin and nucleotide sequences of 16S ribosomal RNA.

MeSH terms

  • Algorithms
  • Amino Acid Sequence
  • Computational Biology / methods*
  • Hemoglobins / genetics
  • Models, Genetic*
  • Mutation / genetics
  • RNA, Ribosomal, 16S / genetics
  • Sequence Analysis, DNA / methods*
  • Sequence Homology*

Substances

  • Hemoglobins
  • RNA, Ribosomal, 16S