Prediction of linear B-cell epitopes based on protein sequence features and BERT embeddings

Sci Rep. 2024 Jan 30;14(1):2464. doi: 10.1038/s41598-024-53028-w.

Abstract

Linear B-cell epitopes (BCEs) play a key role in the development of peptide vaccines and immunodiagnostic reagents. Therefore, the accurate identification of linear BCEs is of great importance in the prevention of infectious diseases and the diagnosis of related diseases. The experimental methods used to identify BCEs are both expensive and time-consuming and they do not meet the demand for identification of large-scale protein sequence data. As a result, there is a need to develop an efficient and accurate computational method to rapidly identify linear BCE sequences. In this work, we developed the new linear BCE prediction method LBCE-BERT. This method is based on peptide chain sequence information and natural language model BERT embedding information, using an XGBoost classifier. The models were trained on three benchmark datasets. The model was training on three benchmark datasets for hyperparameter selection and was subsequently evaluated on several test datasets. The result indicate that our proposed method outperforms others in terms of AUROC and accuracy. The LBCE-BERT model is publicly available at: https://github.com/Lfang111/LBCE-BERT .

MeSH terms

  • Amino Acid Sequence
  • Epitopes, B-Lymphocyte*
  • Proteins* / metabolism

Substances

  • Epitopes, B-Lymphocyte
  • Proteins