A demonstration of unsupervised machine learning in species delimitation

Shahan Derkarabetian; Stephanie Castillo; Peter K Koo; Sergey Ovchinnikov; Marshal Hedin

doi:10.1016/j.ympev.2019.106562

A demonstration of unsupervised machine learning in species delimitation

Mol Phylogenet Evol. 2019 Oct:139:106562. doi: 10.1016/j.ympev.2019.106562. Epub 2019 Jul 16.

Authors

Shahan Derkarabetian¹, Stephanie Castillo², Peter K Koo³, Sergey Ovchinnikov⁴, Marshal Hedin⁵

Affiliations

¹ Department of Organismic and Evolutionary Biology, Museum of Comparative Zoology, Harvard University, Cambridge, MA 02138, United States; Department of Biology, San Diego State University, San Diego, CA 92182, United States; Department of Evolution, Ecology, and Organismal Biology, University of California, Riverside, Riverside, CA 92521, United States. Electronic address: sderkarabetian@gmail.com.
² Department of Biology, San Diego State University, San Diego, CA 92182, United States; Department of Entomology, University of California, Riverside, Riverside, CA 92521, United States.
³ Howard Hughes Medical Institute, Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA 02138, United States.
⁴ Center for Systems Biology, Harvard University, Cambridge, MA 02138, United States.
⁵ Department of Biology, San Diego State University, San Diego, CA 92182, United States.

Abstract

One major challenge to delimiting species with genetic data is successfully differentiating population structure from species-level divergence, an issue exacerbated in taxa inhabiting naturally fragmented habitats. Many fields of science are now using machine learning, and in evolutionary biology supervised machine learning has recently been used to infer species boundaries. These supervised methods require training data with associated labels. Conversely, unsupervised machine learning (UML) uses inherent data structure and does not require user-specified training labels, potentially providing more objectivity in species delimitation. In the context of integrative taxonomy, we demonstrate the utility of three UML approaches (random forests, variational autoencoders, t-distributed stochastic neighbor embedding) for species delimitation in an arachnid taxon with high population genetic structure (Opiliones, Laniatores, Metanonychus). We find that UML approaches successfully cluster samples according to species-level divergences and not high levels of population structure, while model-based validation methods severely over-split putative species. UML offers intuitive data visualization in two-dimensional space, the ability to accommodate various data types, and has potential in many areas of systematic and evolutionary biology. We argue that machine learning methods are ideally suited for species delimitation and may perform well in many natural systems and across taxa with diverse biological characteristics.

Keywords: Integrative taxonomy; Opiliones; Random forest; Ultraconserved elements; Variational autoencoders; t-SNE.

Publication types

Research Support, N.I.H., Extramural
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Animals
Arachnida / classification
Arachnida / genetics
Cluster Analysis
Phylogeny
Polymorphism, Single Nucleotide
Principal Component Analysis
Unsupervised Machine Learning*

Grants and funding

DP5 OD026389/OD/NIH HHS/United States