How Well Does Your Phylogenetic Model Fit Your Data?

Syst Biol. 2019 Jan 1;68(1):157-167. doi: 10.1093/sysbio/syy066.

Abstract

The test for model-to-data fitness is a fundamental principle within the statistical sciences. The purpose of such a test is to assess whether the selected best-fitting model adequately describes the behavior in the data. Despite their broad application across many areas of statistics, goodness of fit tests for phylogenetic models have received much less attention than model selection methods in the last decade. At present a number of approaches have been suggested. However, these are often flawed, with problems ranging from the presence of systematic error in the models themselves to the difficulties presented by the nature of phylogenetic data. Ultimately these problems lead to an inadequate choice of statistic. This is one of the main reasons why goodness of fit assessment is often a neglected step within phylogenetic analysis. We argue not only for the necessity of these goodness of fit measures to test how well the model reflects the data, but additionally for the need for "useful" tests that explain why the model-to-data fit may be inadequate. Such tests are a critical part of the model building process, allowing the model to be adapted to provide a better model-to-data fit or to reject a model class outright due to such an inadequate fit that the intended use of the class may be compromised. Proposed and existing methods in both the maximum likelihood and Bayesian framework will be discussed here, whilst highlighting their strengths and limitations for assessing goodness of fit. The final section discusses some critical open statistical problems in goodness of fit assessment for this field, with the hope of encouraging more research into such a fundamental yet underdeveloped area of phylogenetic inference. [Bayesian phylogenetics; Goodness of fit; maximum likelihood; molecular phylogenetics; outlier detection; residual diagnostics.].

MeSH terms

  • Classification / methods*
  • Data Interpretation, Statistical
  • Models, Biological*
  • Phylogeny*