On TCR binding predictors failing to generalize to unseen peptides

Filippo Grazioli; Anja Mösch; Pierre Machart; Kai Li; Israa Alqassem; Timothy J O'Donnell; Martin Renqiang Min

doi:10.3389/fimmu.2022.1014256

On TCR binding predictors failing to generalize to unseen peptides

Front Immunol. 2022 Oct 21:13:1014256. doi: 10.3389/fimmu.2022.1014256. eCollection 2022.

Authors

Filippo Grazioli¹, Anja Mösch¹, Pierre Machart¹, Kai Li², Israa Alqassem¹, Timothy J O'Donnell³, Martin Renqiang Min²

Affiliations

¹ Biomedical AI Group, NEC Laboratories Europe, Heidelberg, Germany.
² Machine Learning Department, NEC Laboratories America, Princeton, NJ, United States.
³ Division of Hematology and Medical Oncology, Icahn School of Medicine at Mount Sinai, New York, NY, United States.

Abstract

Several recent studies investigate TCR-peptide/-pMHC binding prediction using machine learning or deep learning approaches. Many of these methods achieve impressive results on test sets, which include peptide sequences that are also included in the training set. In this work, we investigate how state-of-the-art deep learning models for TCR-peptide/-pMHC binding prediction generalize to unseen peptides. We create a dataset including positive samples from IEDB, VDJdb, McPAS-TCR, and the MIRA set, as well as negative samples from both randomization and 10X Genomics assays. We name this collection of samples TChard. We propose the hard split, a simple heuristic for training/test split, which ensures that test samples exclusively present peptides that do not belong to the training set. We investigate the effect of different training/test splitting techniques on the models' test performance, as well as the effect of training and testing the models using mismatched negative samples generated randomly, in addition to the negative samples derived from assays. Our results show that modern deep learning methods fail to generalize to unseen peptides. We provide an explanation why this happens and verify our hypothesis on the TChard dataset. We then conclude that robust prediction of TCR recognition is still far for being solved.

Keywords: MHC; TCR - T cell receptor; binding prediction; interaction prediction; machine learning; peptide; tcr.

MeSH terms

Peptides* / metabolism
Protein Binding
Receptors, Antigen, T-Cell* / metabolism

Substances

Receptors, Antigen, T-Cell
Peptides