Protein language models meet reduced amino acid alphabets

Ioan Ieremie; Rob M Ewing; Mahesan Niranjan

doi:10.1093/bioinformatics/btae061

Protein language models meet reduced amino acid alphabets

Bioinformatics. 2024 Feb 1;40(2):btae061. doi: 10.1093/bioinformatics/btae061.

Authors

Ioan Ieremie¹, Rob M Ewing², Mahesan Niranjan¹

Affiliations

¹ Vision, Learning & Control Group, University of Southampton, Southampton SO17 1BJ, United Kingdom.
² Biological Sciences, University of Southampton, Southampton SO17 1BJ, United Kingdom.

Abstract

Motivation: Protein language models (PLMs), which borrowed ideas for modelling and inference from natural language processing, have demonstrated the ability to extract meaningful representations in an unsupervised way. This led to significant performance improvement in several downstream tasks. Clustering amino acids based on their physical-chemical properties to achieve reduced alphabets has been of interest in past research, but their application to PLMs or folding models is unexplored.

Results: Here, we investigate the efficacy of PLMs trained on reduced amino acid alphabets in capturing evolutionary information, and we explore how the loss of protein sequence information impacts learned representations and downstream task performance. Our empirical work shows that PLMs trained on the full alphabet and a large number of sequences capture fine details that are lost in alphabet reduction methods. We further show the ability of a structure prediction model(ESMFold) to fold CASP14 protein sequences translated using a reduced alphabet. For 10 proteins out of the 50 targets, reduced alphabets improve structural predictions with LDDT-Cα differences of up to 19%.

Availability and implementation: Trained models and code are available at github.com/Ieremie/reduced-alph-PLM.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Amines
Amino Acid Sequence
Amino Acids / chemistry
Protein Folding*
Proteins* / chemistry

Substances

Proteins
Amino Acids
Amines

Abstract

Publication types

MeSH terms

Substances

Grants and funding