Machine learning approaches delimit cryptic taxa in a previously intractable species complex

Mol Phylogenet Evol. 2024 Jun:195:108061. doi: 10.1016/j.ympev.2024.108061. Epub 2024 Mar 12.

Abstract

Cryptic species are not diagnosable via morphological criteria, but can be detected through analysis of DNA sequences. A number of methods have been developed for identifying species based on genetic data; however, these methods are prone to over-splitting taxa with extreme population structure, such as dispersal-limited organisms. Machine learning methodologies have the potential to overcome this challenge. Here, we apply such approaches, using a large dataset generated through hybrid target enrichment of ultraconserved elements (UCEs). Our study taxon is the Aoraki denticulata species complex, a lineage of extremely low-dispersal arachnids endemic to the South Island of Aotearoa New Zealand. This group of mite harvesters has been the subject of previous species delimitation studies using smaller datasets generated through Sanger sequencing and analytical approaches that rely on multispecies coalescent models and barcoding gap discovery. Those analyses yielded a number of putative cryptic species that seems unrealistic and extreme, based on what we know about species' geographic ranges and genetic diversity in non-cryptic mite harvesters. We find that machine learning approaches, on the other hand, identify cryptic species with geographic ranges that are similar to those seen in other morphologically diagnosable mite harvesters in Aotearoa New Zealand's South Island. We performed both unsupervised and supervised machine learning analyses, the latter with training data drawn either from animals broadly (vagile and non-vagile) or from a custom training dataset from dispersal-limited harvesters. We conclude that applying machine learning approaches to the analysis of UCE-derived genetic data is an effective method for delimiting species in complexes of low-vagility cryptic species, and that the incorporation of training data from biologically relevant analogues can be critically informative.

Keywords: Aotearoa; Cyphophthalmi; New Zealand; Opiliones.

MeSH terms

  • Animals
  • Arachnida*
  • Machine Learning
  • New Zealand
  • Phylogeny
  • Spiders*