plASgraph2: using graph neural networks to detect plasmid contigs from an assembly graph

Front Microbiol. 2023 Oct 6:14:1267695. doi: 10.3389/fmicb.2023.1267695. eCollection 2023.

Abstract

Identification of plasmids from sequencing data is an important and challenging problem related to antimicrobial resistance spread and other One-Health issues. We provide a new architecture for identifying plasmid contigs in fragmented genome assemblies built from short-read data. We employ graph neural networks (GNNs) and the assembly graph to propagate the information from nearby nodes, which leads to more accurate classification, especially for short contigs that are difficult to classify based on sequence features or database searches alone. We trained plASgraph2 on a data set of samples from the ESKAPEE group of pathogens. plASgraph2 either outperforms or performs on par with a wide range of state-of-the-art methods on testing sets of independent ESKAPEE samples and samples from related pathogens. On one hand, our study provides a new accurate and easy to use tool for contig classification in bacterial isolates; on the other hand, it serves as a proof-of-concept for the use of GNNs in genomics. Our software is available at https://github.com/cchauve/plasgraph2 and the training and testing data sets are available at https://github.com/fmfi-compbio/plasgraph2-datasets.

Keywords: assembly graph; bioinformatics; classification; machine learning (ML); plasmids.

Grants and funding

The author(s) declare financial support was received for the research, authorship, and/or publication of this article. This research was supported by the European Union Horizon 2020 Grant No. 872539 (PANGAIA). JS and KS were supported by the Bielefeld University Graduate School Digital Infrastructure for the Life Sciences (DILS) Grant. This research was also supported by grants 1/0463/20 (BB) and 1/0538/22 (TV) from the Scientific Grant Agency of the Ministry of Education, Science, Research, and Sport of the Slovak Republic and Slovak Academy of Sciences (VEGA), Grant APVV-22-0144 from the Slovak Research and Development Agency (BB and TV), and Discovery Grant RGPIN/03986-2017 from the Natural Sciences and Engineering Research Council of Canada (CC). This research was enabled in part by computational infrastructure support provided by Digital Research Alliance of Canada (https://alliancecan.ca).