Statistical Analysis and Tokenization of Epitopes to Construct Artificial Neoepitope Libraries

ACS Synth Biol. 2023 Oct 20;12(10):2812-2818. doi: 10.1021/acssynbio.3c00201. Epub 2023 Sep 13.

Abstract

Epitopes are specific regions on an antigen's surface that the immune system recognizes. Epitopes are usually protein regions on foreign immune-stimulating entities such as viruses and bacteria, and in some cases, endogenous proteins may act as antigens. Identifying epitopes is crucial for accelerating the development of vaccines and immunotherapies. However, mapping epitopes in pathogen proteomes is challenging using conventional methods. Screening artificial neoepitope libraries against antibodies can overcome this issue. Here, we applied conventional sequence analysis and methods inspired in natural language processing to reveal specific sequence patterns in the linear epitopes deposited in the Immune Epitope Database (www.iedb.org) that can serve as building blocks for the design of universal epitope libraries. Our results reveal that amino acid frequency in annotated linear epitopes differs from that in the human proteome. Aromatic residues are overrepresented, while the presence of cysteines is practically null in epitopes. Byte pair encoding tokenization shows high frequencies of tryptophan in tokens of 5, 6, and 7 amino acids, corroborating the findings of the conventional sequence analysis. These results can be applied to reduce the diversity of linear epitope libraries by orders of magnitude.

Keywords: byte pair encoding; epitope analysis; library design; natural language processing; tokenization.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Amino Acid Sequence
  • Amino Acids
  • Epitope Mapping / methods
  • Epitopes / genetics
  • Humans
  • Proteome
  • Viruses*

Substances

  • Epitopes
  • Proteome
  • Amino Acids