Application of the random forest algorithm to Streptococcus pyogenes response regulator allele variation: from machine learning to evolutionary models

Sci Rep. 2021 Jun 16;11(1):12687. doi: 10.1038/s41598-021-91941-6.

Abstract

Group A Streptococcus (GAS) is a globally significant bacterial pathogen. The GAS genotyping gold standard characterises the nucleotide variation of emm, which encodes a surface-exposed protein that is recombinogenic and under immune-based selection pressure. Within a supervised learning methodology, we tested three random forest (RF) algorithms (Guided, Ordinary, and Regularized) and 53 GAS response regulator (RR) allele types to infer six genomic traits (emm-type, emm-subtype, tissue and country of sample, clinical outcomes, and isolate invasiveness). The Guided, Ordinary, and Regularized RF classifiers inferred the emm-type with accuracies of 96.7%, 95.7%, and 95.2%, using ten, three, and four RR alleles in the feature set, respectively. Notably, we inferred the emm-type with 93.7% accuracy using only mga2 and lrp. We demonstrated a utility for inferring emm-subtype (89.9%), country (88.6%), invasiveness (84.7%), but not clinical (56.9%), or tissue (56.4%), which is consistent with the complexity of GAS pathophysiology. We identified a novel cell wall-spanning domain (SF5), and proposed evolutionary pathways depicting the 'contrariwise' and 'likewise' chimeric deletion-fusion of emm and enn. We identified an intermediate strain, which provides evidence of the time-dependent excision of mga regulon genes. Overall, our workflow advances the understanding of the GAS mga regulon and its plasticity.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Alleles
  • Antigens, Bacterial / genetics
  • Bacterial Outer Membrane Proteins / genetics
  • Bacterial Proteins / genetics
  • Carrier Proteins / genetics
  • Evolution, Molecular*
  • Genes, Bacterial
  • Genetic Variation*
  • Genome, Bacterial
  • Humans
  • Machine Learning*
  • Regulon*
  • Streptococcal Infections / microbiology
  • Streptococcus pyogenes / classification
  • Streptococcus pyogenes / genetics*
  • Streptococcus pyogenes / pathogenicity

Substances

  • Antigens, Bacterial
  • Bacterial Outer Membrane Proteins
  • Bacterial Proteins
  • Carrier Proteins
  • streptococcal M protein