NGS read classification using AI

PLoS One. 2021 Dec 22;16(12):e0261548. doi: 10.1371/journal.pone.0261548. eCollection 2021.

Abstract

Clinical metagenomics is a powerful diagnostic tool, as it offers an open view into all DNA in a patient's sample. This allows the detection of pathogens that would slip through the cracks of classical specific assays. However, due to this unspecific nature of metagenomic sequencing, a huge amount of unspecific data is generated during the sequencing itself and the diagnosis only takes place at the data analysis stage where relevant sequences are filtered out. Typically, this is done by comparison to reference databases. While this approach has been optimized over the past years and works well to detect pathogens that are represented in the used databases, a common challenge in analysing a metagenomic patient sample arises when no pathogen sequences are found: How to determine whether truly no evidence of a pathogen is present in the data or whether the pathogen's genome is simply absent from the database and the sequences in the dataset could thus not be classified? Here, we present a novel approach to this problem of detecting novel pathogens in metagenomic datasets by classifying the (segments of) proteins encoded by the sequences in the datasets. We train a neural network on the sequences of coding sequences, labeled by taxonomic domain, and use this neural network to predict the taxonomic classification of sequences that can not be classified by comparison to a reference database, thus facilitating the detection of potential novel pathogens.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Animals
  • Bacteria / classification
  • Bacteria / genetics
  • DNA / classification
  • DNA / genetics
  • DNA, Bacterial / classification
  • DNA, Bacterial / genetics
  • DNA, Viral / classification
  • DNA, Viral / genetics
  • High-Throughput Nucleotide Sequencing / methods*
  • Humans
  • Metagenome
  • Metagenomics / methods*
  • Neural Networks, Computer*
  • Viruses / classification
  • Viruses / genetics

Substances

  • DNA, Bacterial
  • DNA, Viral
  • DNA

Grants and funding

BV and OF are funded by the Federal Ministry of Education and Research of Germany (BMBF, https://www.bmbf.de/) in the project deep. Health (project number 13FH770IX6). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.