Applying convolutional neural networks to speed up environmental DNA annotation in a highly diverse ecosystem

Sci Rep. 2022 Jun 17;12(1):10247. doi: 10.1038/s41598-022-13412-w.

Abstract

High-throughput DNA sequencing is becoming an increasingly important tool to monitor and better understand biodiversity responses to environmental changes in a standardized and reproducible way. Environmental DNA (eDNA) from organisms can be captured in ecosystem samples and sequenced using metabarcoding, but processing large volumes of eDNA data and annotating sequences to recognized taxa remains computationally expensive. Speed and accuracy are two major bottlenecks in this critical step. Here, we evaluated the ability of convolutional neural networks (CNNs) to process short eDNA sequences and associate them with taxonomic labels. Using a unique eDNA data set collected in highly diverse Tropical South America, we compared the speed and accuracy of CNNs with that of a well-known bioinformatic pipeline (OBITools) in processing a small region (60 bp) of the 12S ribosomal DNA targeting freshwater fishes. We found that the taxonomic labels from the CNNs were comparable to those from OBITools, with high correlation levels for the composition of the regional fish fauna. The CNNs enabled the processing of raw fastq files at a rate of approximately 1 million sequences per minute, which was about 150 times faster than with OBITools. Given the good performance of CNNs in the highly diverse ecosystem considered here, the development of more elaborate CNNs promises fast deployment for future biodiversity inventories using eDNA.

MeSH terms

  • Animals
  • Biodiversity
  • DNA Barcoding, Taxonomic
  • DNA, Environmental* / genetics
  • Ecosystem*
  • Environmental Monitoring
  • Fishes / genetics
  • Neural Networks, Computer

Substances

  • DNA, Environmental