Identification of Phage Receptor-Binding Protein Sequences with Hidden Markov Models and an Extreme Gradient Boosting Classifier

Viruses. 2022 Jun 17;14(6):1329. doi: 10.3390/v14061329.

Abstract

Receptor-binding proteins (RBPs) of bacteriophages initiate the infection of their corresponding bacterial host and act as the primary determinant for host specificity. The ever-increasing amount of sequence data enables the development of predictive models for the automated identification of RBP sequences. However, the development of such models is challenged by the inconsistent or missing annotation of many phage proteins. Recently developed tools have started to bridge this gap but are not specifically focused on RBP sequences, for which many different annotations are available. We have developed two parallel approaches to alleviate the complex identification of RBP sequences in phage genomic data. The first combines known RBP-related hidden Markov models (HMMs) from the Pfam database with custom-built HMMs to identify phage RBPs based on protein domains. The second approach consists of training an extreme gradient boosting classifier that can accurately discriminate between RBPs and other phage proteins. We explained how these complementary approaches can reinforce each other in identifying RBP sequences. In addition, we benchmarked our methods against the recently developed PhANNs tool. Our best performing model reached a precision-recall area-under-the-curve of 93.8% and outperformed PhANNs on an independent test set, reaching an F1-score of 84.0% compared to 69.8%.

Keywords: extreme gradient boosting; hidden Markov models; machine learning; phage; receptor-binding protein.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Bacteriophage Receptors*
  • Bacteriophages* / genetics
  • Bacteriophages* / metabolism
  • Carrier Proteins / metabolism
  • Protein Binding
  • Proteins / metabolism

Substances

  • Bacteriophage Receptors
  • Carrier Proteins
  • Proteins

Grants and funding

D.B. is supported by the Research Foundation—Flanders (FWO), grant number 1S69520N. M.S. and B.D.B. received funding from the Flemish Government under the “Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen” program.