Integrating Embeddings from Multiple Protein Language Models to Improve Protein O-GlcNAc Site Prediction

Int J Mol Sci. 2023 Nov 6;24(21):16000. doi: 10.3390/ijms242116000.

Abstract

O-linked β-N-acetylglucosamine (O-GlcNAc) is a distinct monosaccharide modification of serine (S) or threonine (T) residues of nucleocytoplasmic and mitochondrial proteins. O-GlcNAc modification (i.e., O-GlcNAcylation) is involved in the regulation of diverse cellular processes, including transcription, epigenetic modifications, and cell signaling. Despite the great progress in experimentally mapping O-GlcNAc sites, there is an unmet need to develop robust prediction tools that can effectively locate the presence of O-GlcNAc sites in protein sequences of interest. In this work, we performed a comprehensive evaluation of a framework for prediction of protein O-GlcNAc sites using embeddings from pre-trained protein language models. In particular, we compared the performance of three protein sequence-based large protein language models (pLMs), Ankh, ESM-2, and ProtT5, for prediction of O-GlcNAc sites and also evaluated various ensemble strategies to integrate embeddings from these protein language models. Upon investigation, the decision-level fusion approach that integrates the decisions of the three embedding models, which we call LM-OGlcNAc-Site, outperformed the models trained on these individual language models as well as other fusion approaches and other existing predictors in almost all of the parameters evaluated. The precise prediction of O-GlcNAc sites will facilitate the probing of O-GlcNAc site-specific functions of proteins in physiology and diseases. Moreover, these findings also indicate the effectiveness of combined uses of multiple protein language models in post-translational modification prediction and open exciting avenues for further research and exploration in other protein downstream tasks. LM-OGlcNAc-Site's web server and source code are publicly available to the community.

Keywords: O-GlcNAc prediction; embeddings; ensemble learning; post-translational modification prediction; protein language models.

MeSH terms

  • Acetylglucosamine / metabolism
  • Amino Acid Sequence
  • N-Acetylglucosaminyltransferases / metabolism
  • Protein Processing, Post-Translational*
  • Proteins* / chemistry

Substances

  • Proteins
  • Acetylglucosamine
  • N-Acetylglucosaminyltransferases

Grants and funding

This research was funded by the National Science Foundation (NSF), grant number 1901793, 2210356 (granted to D.B.K.). Part of this work was supported by the MDHHS Michigan Sequencing and Academic Partnerships for Public Health Innovation and Response (MI-SAPPHIRE) grant.