Estimation of relative effectiveness of phylogenetic programs by machine learning

J Bioinform Comput Biol. 2014 Apr;12(2):1441004. doi: 10.1142/S0219720014410042. Epub 2014 Mar 6.

Abstract

Reconstruction of phylogeny of a protein family from a sequence alignment can produce results of different quality. Our goal is to predict the quality of phylogeny reconstruction basing on features that can be extracted from the input alignment. We used Fitch-Margoliash (FM) method of phylogeny reconstruction and random forest as a predictor. For training and testing the predictor, alignments of orthologous series (OS) were used, for which the result of phylogeny reconstruction can be evaluated by comparison with trees of corresponding organisms. Our results show that the quality of phylogeny reconstruction can be predicted with more than 80% precision. Also, we tried to predict which phylogeny reconstruction method, FM or UPGMA, is better for a particular alignment. With the used set of features, among alignments for which the obtained predictor predicts a better performance of UPGMA, 56% really give a better result with UPGMA. Taking into account that in our testing set only for 34% alignments UPGMA performs better, this result shows a principal possibility to predict the better phylogeny reconstruction method basing on features of a sequence alignment.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Archaea / genetics*
  • Artificial Intelligence*
  • Base Sequence
  • Conserved Sequence
  • Eukaryota / genetics*
  • Molecular Sequence Data
  • Phylogeny
  • Proteobacteria / genetics*
  • Proteome / genetics*
  • Sequence Analysis, DNA / methods*

Substances

  • Proteome