Re-evaluating Deep Neural Networks for Phylogeny Estimation: The Issue of Taxon Sampling

Paul Zaharias; Martin Grosshauser; Tandy Warnow

doi:10.1089/cmb.2021.0383

Re-evaluating Deep Neural Networks for Phylogeny Estimation: The Issue of Taxon Sampling

J Comput Biol. 2022 Jan;29(1):74-89. doi: 10.1089/cmb.2021.0383. Epub 2022 Jan 5.

Authors

Paul Zaharias¹, Martin Grosshauser², Tandy Warnow¹

Affiliations

¹ Department of Computer Science, University of Illinois, Urbana, Illinois, USA.
² Department of Physics, Technical University of Munich, Munich, Germany.

PMID: 34986031
DOI: 10.1089/cmb.2021.0383

Abstract

Deep neural networks (DNNs) have been recently proposed for quartet tree phylogeny estimation. Here, we present a study evaluating recently trained DNNs in comparison to a collection of standard phylogeny estimation methods on a heterogeneous collection of datasets simulated under the same models that were used to train the DNNs, and also under similar conditions but with higher rates of evolution. Our study shows that using DNNs with quartet amalgamation is less accurate than several standard phylogeny estimation methods we explore (e.g., maximum likelihood and maximum parsimony). We further find that simple standard phylogeny estimation methods match or improve on DNNs for quartet accuracy, especially, but not exclusively, when used in a global manner (i.e., the tree on the full dataset is computed and then the induced quartet trees are extracted from the full tree). Thus, our study provides evidence that a major challenge impacting the utility of current DNNs for phylogeny estimation is their restriction to estimating quartet trees that must subsequently be combined into a tree on the full dataset. In contrast, global methods (i.e., those that estimate trees from the full set of sequences) are able to benefit from taxon sampling, and hence have higher accuracy on large datasets.

Keywords: deep neural networks; phylogeny estimation and heterotachy.

Publication types

Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Amino Acid Sequence
Classification / methods
Computational Biology
Computer Simulation
Databases, Genetic / statistics & numerical data
Deep Learning*
Evolution, Molecular
Neural Networks, Computer*
Phylogeny*