Sa-TTCA: An SVM-based approach for tumor T-cell antigen classification using features extracted from biological sequencing and natural language processing

Thi-Oanh Tran; Nguyen Quoc Khanh Le

doi:10.1016/j.compbiomed.2024.108408

Sa-TTCA: An SVM-based approach for tumor T-cell antigen classification using features extracted from biological sequencing and natural language processing

Comput Biol Med. 2024 May:174:108408. doi: 10.1016/j.compbiomed.2024.108408. Epub 2024 Apr 4.

Authors

Thi-Oanh Tran¹, Nguyen Quoc Khanh Le²

Affiliations

¹ International Ph.D. Program in Cell Therapy and Regenerative Medicine, College of Medicine, Taipei Medical University, Taipei, 110, Taiwan; AIBioMed Research Group, Taipei Medical University, Taipei, 110, Taiwan; Hematology and Blood Transfusion Center, Bach Mai Hospital, No. 78, Giai Phong Street, Hanoi, Viet Nam.
² AIBioMed Research Group, Taipei Medical University, Taipei, 110, Taiwan; Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei, 110, Taiwan; Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei, 110, Taiwan; Translational Imaging Research Center, Taipei Medical University Hospital, Taipei, 110, Taiwan. Electronic address: khanhlee@tmu.edu.tw.

PMID: 38636332
DOI: 10.1016/j.compbiomed.2024.108408

Abstract

Accurately predicting tumor T-cell antigen (TTCA) sequences is a crucial task in the development of cancer vaccines and immunotherapies. TTCAs derived from tumor cells, are presented to immune cells (T cells) through major histocompatibility complex (MHC), via the recognition of specific portions of their structure known as epitopes. More specifically, MHC class I introduces TTCAs to T-cell receptors (TCR) which are located on the surface of CD8+ T cells. However, TTCA sequences are varied and lead to struggles in vaccine design. Recently, Machine learning (ML) models have been developed to predict TTCA sequences which could aid in fast and correct TTCA identification. During the construction of the TTCA predictor, the peptide encoding strategy is an important step. Previous studies have used biological descriptors for encoding TTCA sequences. However, there have been no studies that use natural language processing (NLP), a potential approach for this purpose. As sentences have their own words with diverse properties, biological sequences also hold unique characteristics that reflect evolutionary information, physicochemical values, and structural information. We hypothesized that NLP methods would benefit the prediction of TTCA. To develop a new identifying TTCA model, we first constructed a based model with widely used ML algorithms and extracted features from biological descriptors. Then, to improve our model performance, we added extracted features from biological language models (BLMs) based on NLP methods. Besides, we conducted feature selection by using Chi-square and Pearson Correlation Coefficient techniques. Then, SMOTE, Up-sampling, and Near-Miss were used to treat unbalanced data. Finally, we optimized Sa-TTCA by the SVM algorithm to the four most effective feature groups. The best performance of Sa-TTCA showed a competitive balanced accuracy of 87.5% on a training set, and 72.0% on an independent testing set. Our results suggest that integrating biological descriptors with natural language processing has the potential to improve the precision of predicting protein/peptide functionality, which could be beneficial for developing cancer vaccines.

Keywords: Biological language models; Cancer vaccines; Machine learning; Peptide sequences; Protein encoding.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Antigens, Neoplasm* / chemistry
Antigens, Neoplasm* / genetics
Antigens, Neoplasm* / immunology
Computational Biology / methods
Humans
Natural Language Processing*
Neoplasms / immunology
Sequence Analysis, Protein / methods
Support Vector Machine*

Substances

Antigens, Neoplasm