mtx-COBRA: Subcellular localization prediction for bacterial proteins

Comput Biol Med. 2024 Mar:171:108114. doi: 10.1016/j.compbiomed.2024.108114. Epub 2024 Feb 10.

Abstract

Background: Bacteria can have beneficial effects on our health and environment; however, many are responsible for serious infectious diseases, warranting the need for vaccines against such pathogens. Bioinformatic and experimental technologies are crucial for the development of vaccines. The vaccine design pipeline requires identification of bacteria-specific antigens that can be recognized and can induce a response by the immune system upon infection. Immune system recognition is influenced by the location of a protein. Methods have been developed to determine the subcellular localization (SCL) of proteins in prokaryotes and eukaryotes. Bioinformatic tools such as PSORTb can be employed to determine SCL of proteins, which would be tedious to perform experimentally. Unfortunately, PSORTb often predicts many proteins as having an "Unknown" SCL, reducing the number of antigens to evaluate as potential vaccine targets.

Method: We present a new pipeline called subCellular lOcalization prediction for BacteRiAl Proteins (mtx-COBRA). mtx-COBRA uses Meta's protein language model, Evolutionary Scale Modeling, combined with an Extreme Gradient Boosting machine learning model to identify SCL of bacterial proteins based on amino acid sequence. This pipeline is trained on a curated dataset that combines data from UniProt and the publicly available ePSORTdb dataset.

Results: Using benchmarking analyses, nested 5-fold cross-validation, and leave-one-pathogen-out methods, followed by testing on the held-out dataset, we show that our pipeline predicts the SCL of bacterial proteins more accurately than PSORTb.

Conclusions: mtx-COBRA provides an accessible pipeline that can more efficiently classify bacterial proteins with currently "Unknown" SCLs than existing bioinformatic and experimental methods.

Keywords: Bacterial subcellular localization; Machine learning; Protein language model; Reverse vaccinology; mtx-COBRA.

MeSH terms

  • Amino Acid Sequence
  • Bacteria
  • Bacterial Proteins* / chemistry
  • Computational Biology / methods
  • Software
  • Vaccines*

Substances

  • Bacterial Proteins
  • Vaccines