Sa-TTCA: An SVM-based approach for tumor T-cell antigen classification using features extracted from biological sequencing and natural language processing

Comput Biol Med. 2024 May:174:108408. doi: 10.1016/j.compbiomed.2024.108408. Epub 2024 Apr 4.

Abstract

Accurately predicting tumor T-cell antigen (TTCA) sequences is a crucial task in the development of cancer vaccines and immunotherapies. TTCAs derived from tumor cells, are presented to immune cells (T cells) through major histocompatibility complex (MHC), via the recognition of specific portions of their structure known as epitopes. More specifically, MHC class I introduces TTCAs to T-cell receptors (TCR) which are located on the surface of CD8+ T cells. However, TTCA sequences are varied and lead to struggles in vaccine design. Recently, Machine learning (ML) models have been developed to predict TTCA sequences which could aid in fast and correct TTCA identification. During the construction of the TTCA predictor, the peptide encoding strategy is an important step. Previous studies have used biological descriptors for encoding TTCA sequences. However, there have been no studies that use natural language processing (NLP), a potential approach for this purpose. As sentences have their own words with diverse properties, biological sequences also hold unique characteristics that reflect evolutionary information, physicochemical values, and structural information. We hypothesized that NLP methods would benefit the prediction of TTCA. To develop a new identifying TTCA model, we first constructed a based model with widely used ML algorithms and extracted features from biological descriptors. Then, to improve our model performance, we added extracted features from biological language models (BLMs) based on NLP methods. Besides, we conducted feature selection by using Chi-square and Pearson Correlation Coefficient techniques. Then, SMOTE, Up-sampling, and Near-Miss were used to treat unbalanced data. Finally, we optimized Sa-TTCA by the SVM algorithm to the four most effective feature groups. The best performance of Sa-TTCA showed a competitive balanced accuracy of 87.5% on a training set, and 72.0% on an independent testing set. Our results suggest that integrating biological descriptors with natural language processing has the potential to improve the precision of predicting protein/peptide functionality, which could be beneficial for developing cancer vaccines.

Keywords: Biological language models; Cancer vaccines; Machine learning; Peptide sequences; Protein encoding.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Antigens, Neoplasm* / chemistry
  • Antigens, Neoplasm* / genetics
  • Antigens, Neoplasm* / immunology
  • Computational Biology / methods
  • Humans
  • Natural Language Processing*
  • Neoplasms / immunology
  • Sequence Analysis, Protein / methods
  • Support Vector Machine*

Substances

  • Antigens, Neoplasm