GPApred: The first computational predictor for identifying proteins with LPXTG-like motif using sequence-based optimal features

Int J Biol Macromol. 2023 Feb 28:229:529-538. doi: 10.1016/j.ijbiomac.2022.12.315. Epub 2022 Dec 31.

Abstract

The cell surface proteins of gram-positive bacteria are involved in many important biological functions, including the infection of host cells. Owing to their virulent nature, these proteins are also considered strong candidates for potential drug or vaccine targets. Among the various cell surface proteins of gram-positive bacteria, LPXTG-like proteins form a major class. These proteins have a highly conserved C-terminal cell wall sorting signal, which consists of an LPXTG sequence motif, a hydrophobic domain, and a positively charged tail. These surface proteins are targeted to the cell envelope by a sortase enzyme via transpeptidation. A variety of LPXTG-like proteins have been experimentally characterized; however, their number in public databases has increased owing to extensive bacterial genome sequencing without proper annotation. In the absence of experimental characterization, identifying and annotating these sequences is extremely challenging. Therefore, in this study, we developed the first machine learning-based predictor called GPApred, which can identify LPXTG-like proteins from their primary sequences. Using a newly constructed benchmark dataset, we explored different classifiers and five feature encodings and their hybrids. Optimal features were derived using the recursive feature elimination method, and these features were then trained using a support vector machine algorithm. The performance of different models was evaluated using independent datasets, and a final model (GPApred) was selected based on consistency during cross-validation and independent assessment. GPApred can be an effective tool for predicting LPXTG-like sequences and can be further employed for functional characterization or drug targeting. Availability: https://procarb.org/gpapred/.

Keywords: Cell wall sorting signal; Feature selection; Machine learning; Sortase; Support vector machine; Surface proteins.

MeSH terms

  • Aminoacyltransferases* / metabolism
  • Bacterial Proteins* / chemistry
  • Base Sequence
  • Cysteine Endopeptidases / metabolism
  • Membrane Proteins / metabolism

Substances

  • Bacterial Proteins
  • Aminoacyltransferases
  • Cysteine Endopeptidases
  • Membrane Proteins