Whole genome/proteome based phylogeny reconstruction for prokaryotes using higher order Markov model and chaos game representation

Mol Phylogenet Evol. 2016 Mar:96:102-111. doi: 10.1016/j.ympev.2015.12.011. Epub 2015 Dec 24.

Abstract

Traditional methods for sequence comparison and phylogeny reconstruction rely on pair wise and multiple sequence alignments. But alignment could not be directly applied to whole genome/proteome comparison and phylogenomic studies due to their high computational complexity. Hence alignment-free methods became popular in recent years. Here we propose a fast alignment-free method for whole genome/proteome comparison and phylogeny reconstruction using higher order Markov model and chaos game representation. In the present method, we use the transition matrices of higher order Markov models to characterize amino acid or DNA sequences for their comparison. The order of the Markov model is uniquely identified by maximizing the average Shannon entropy of conditional probability distributions. Using one-dimensional chaos game representation and linked list, this method can reduce large memory and time consumption which is due to the large-scale conditional probability distributions. To illustrate the effectiveness of our method, we employ it for fast phylogeny reconstruction based on genome/proteome sequences of two species data sets used in previous published papers. Our results demonstrate that the present method is useful and efficient.

Availability and implementation: The source codes for our algorithm to get the distance matrix and genome/proteome sequences can be downloaded from ftp://121.199.20.25/. The software Phylip and EvolView we used to construct phylogenetic trees can be referred from their websites.

Keywords: Alignment-free whole proteome comparison; Chaos game representation; Higher order Markov model; Phylogenetic tree; Shannon entropy.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Genome / genetics*
  • Markov Chains*
  • Nonlinear Dynamics*
  • Phylogeny*
  • Prokaryotic Cells / metabolism*
  • Proteome / genetics*
  • Sequence Alignment
  • Software

Substances

  • Proteome