Predicting bacterial virulence factors - evaluation of machine learning and negative data strategies

Brief Bioinform. 2020 Sep 25;21(5):1596-1608. doi: 10.1093/bib/bbz076.

Abstract

Bacterial proteins dubbed virulence factors (VFs) are a highly diverse group of sequences, whose only obvious commonality is the very property of being, more or less directly, involved in virulence. It is therefore tempting to speculate whether their prediction, based on direct sequence similarity (seqsim) to known VFs, could be enhanced or even replaced by using machine-learning methods. Specifically, when trained on a large and diverse set of VFs, such may be able to detect putative, non-trivial characteristics shared by otherwise unrelated VF families and therefore better predict novel VFs with insignificant similarity to each individual family. We therefore first reassess the performance of dimer-based Support Vector Machines, as used in the widely used MP3 method, in light of seqsim-only and seqsim/dimer-hybrid classifiers. We then repeat the analysis with a novel, considerably more diverse data set, also addressing the important problem of negative data selection. Finally, we move on to the real-world use case of proteome-wide VF prediction, outlining different approaches to estimating specificity in this scenario. We find that direct seqsim is of unparalleled importance and therefore should always be exploited. Further, we observe strikingly low correlations between different feature and classifier types when ranking proteins by VF likeness. We therefore propose a 'best of each world' approach to prioritize proteins for experimental testing, focussing on the top predictions of each classifier. Further, classifiers for individual VF families should be developed.

Keywords: benchmark; machine learning; negative data; prediction; sequence similarity; virulence factors.

Publication types

  • Research Support, Non-U.S. Gov't
  • Review

MeSH terms

  • Algorithms
  • Amino Acid Sequence
  • Bacteria / pathogenicity*
  • Bacterial Proteins / chemistry
  • Bacterial Proteins / metabolism*
  • Datasets as Topic
  • Dimerization
  • Proteome
  • Support Vector Machine*
  • Virulence Factors / chemistry
  • Virulence Factors / metabolism*

Substances

  • Bacterial Proteins
  • Proteome
  • Virulence Factors