Classification of bacterial plasmid and chromosome derived sequences using machine learning

PLoS One. 2022 Dec 16;17(12):e0279280. doi: 10.1371/journal.pone.0279280. eCollection 2022.

Abstract

Plasmids are important genetic elements that facilitate horizonal gene transfer between bacteria and contribute to the spread of virulence and antimicrobial resistance. Most bacterial genome sequences in the public archives exist in draft form with many contigs, making it difficult to determine if a contig is of chromosomal or plasmid origin. Using a training set of contigs comprising 10,584 chromosomes and 10,654 plasmids from the PATRIC database, we evaluated several machine learning models including random forest, logistic regression, XGBoost, and a neural network for their ability to classify chromosomal and plasmid sequences using nucleotide k-mers as features. Based on the methods tested, a neural network model that used nucleotide 6-mers as features that was trained on randomly selected chromosomal and plasmid subsequences 5kb in length achieved the best performance, outperforming existing out-of-the-box methods, with an average accuracy of 89.38% ± 2.16% over a 10-fold cross validation. The model accuracy can be improved to 92.08% by using a voting strategy when classifying holdout sequences. In both plasmids and chromosomes, subsequences encoding functions involved in horizontal gene transfer-including hypothetical proteins, transporters, phage, mobile elements, and CRISPR elements-were most likely to be misclassified by the model. This study provides a straightforward approach for identifying plasmid-encoding sequences in short read assemblies without the need for sequence alignment-based tools.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Bacteria / genetics
  • Chromosomes, Bacterial* / genetics
  • Genome, Bacterial*
  • Machine Learning
  • Nucleotides
  • Plasmids / genetics

Substances

  • Nucleotides

Grants and funding

This work was supported by Natural Science Foundation of China (81900009), and Innovation Team and Talents Cultivation Program of National Administration of Traditional Chinese Medicine. (No: ZYYCXTD-D-202208). This work was also funded in part by the United States National Institute of Allergy and Infectious Diseases Bacterial and Viral Bioinformatics Resource Center (BRC) award [Contract No. 75N93019C00076] to PI Rick Stevens.The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. James Davis, Marcus Nguyen, and Jamie Overbeek: United States National Institute of Allergy and Infectious Diseases Bacterial and Viral Bioinformatics Resource Center (BRC) award [Contract No. 75N93019C00076]. Xiaohui Zou and Bin Cao: Natural Science Foundation of China (81900009) Xiaohui Zou: Innovation Team and Talents Cultivation Program of National Administration of Traditional Chinese Medicine. (No: ZYYCXTD-D-202208).