Evidence of absence treated as absence of evidence: The effects of variation in the number and distribution of gaps treated as missing data on the results of standard maximum likelihood analysis

Denis Jacob Machado; Santiago Castroviejo-Fisher; Taran Grant

doi:10.1016/j.ympev.2020.106966

Evidence of absence treated as absence of evidence: The effects of variation in the number and distribution of gaps treated as missing data on the results of standard maximum likelihood analysis

Mol Phylogenet Evol. 2021 Jan:154:106966. doi: 10.1016/j.ympev.2020.106966. Epub 2020 Sep 22.

Authors

Denis Jacob Machado¹, Santiago Castroviejo-Fisher², Taran Grant³

Affiliations

¹ University of North Carolina at Charlotte, College of Computing and Informatics, Department of Bioinformatcis and Genomics, 9201 University City Blvd, Charlotte, NC 28223, USA; Universidade de São Paulo, Programa Interunidades de Pós-Graduação em Bioinformática, Rua do Matão, 1010, CEP: 05508-090 São Paulo, SP, Brazil. Electronic address: dmachado@uncc.edu.
² Pontifícia Universidade Católica do Rio Grande do Sul, Laboratório de Sistemática de Vertebrados, Avenida Ipiranga, 6681, prédio 12, Partenon, CEP: 90619-900 Porto Alegre, RS, Brazil.
³ Universidade de São Paulo, Instituto de Biociências, Departamento de Zoologia, Laboratório de Anfíbios, Rua do Matão, tv. 14, 101, Cidade Universitária, CEP: 05508-090 São Paulo, SP, Brazil.

PMID: 32971285
DOI: 10.1016/j.ympev.2020.106966

Abstract

Although numerous studies have demonstrated the theoretical and empirical importance of treating gaps as insertion/deletion (indel) events in phylogenetic analyses, the standard approach to maximum likelihood (ML) analysis employed in the vast majority of empirical studies codes gaps as nucleotides of unknown identity ("missing data"). Therefore, it is imperative to understand the empirical consequences of different numbers and distributions of gaps treated as missing data. We evaluated the effects of variation in the number and distribution of gaps (i.e., no base, coded as IUPAC "." or "-") treated as missing data (i.e., any base, coded as "?" or IUPAC "N") in standard ML analysis. We obtained alignments with variable numbers and arrangements of gaps by aligning seven diverse empirical datasets under different gap opening costs using MAFFT. We selected the optimal substitution model for each alignment using the corrected Akaike Information Criterion in jModelTest2 and searched for optimal trees using GARLI. We also employed a Monte Carlo approach to randomly replace nucleotides with gaps (treated as missing data) in an empirical dataset to understand more precisely the effects of varying their number and distribution. To compare alignments, we developed four new indices and used several existing measures to quantify the number and distribution of gaps in all alignments. Our most important finding is that ML scores correlate negatively with gap opening costs and the amount of missing data. However, this negative relationship is not due to the increase in missing data per se-which increases ML scores-but instead to the effect of gaps on nucleotide homology. These variables also cause significant but largely unpredictable effects on tree topology.

Keywords: Alignment; Ambiguous characters; Indel; Maximum likelihood; Missing data; Phylogenetic analysis.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Databases, Genetic
Likelihood Functions
Monte Carlo Method
Nucleotides / genetics
Phylogeny*
Reference Standards
Sequence Alignment

Substances

Nucleotides