Mapping the glycosyltransferase fold landscape using interpretable deep learning

Nat Commun. 2021 Sep 27;12(1):5656. doi: 10.1038/s41467-021-25975-9.

Abstract

Glycosyltransferases (GTs) play fundamental roles in nearly all cellular processes through the biosynthesis of complex carbohydrates and glycosylation of diverse protein and small molecule substrates. The extensive structural and functional diversification of GTs presents a major challenge in mapping the relationships connecting sequence, structure, fold and function using traditional bioinformatics approaches. Here, we present a convolutional neural network with attention (CNN-attention) based deep learning model that leverages simple secondary structure representations generated from primary sequences to provide GT fold prediction with high accuracy. The model learns distinguishing secondary structure features free of primary sequence alignment constraints and is highly interpretable. It delineates sequence and structural features characteristic of individual fold types, while classifying them into distinct clusters that group evolutionarily divergent families based on shared secondary structural features. We further extend our model to classify GT families of unknown folds and variants of known folds. By identifying families that are likely to adopt novel folds such as GT91, GT96 and GT97, our studies expand the GT fold landscape and prioritize targets for future structural studies.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Amino Acid Sequence / genetics
  • Computational Biology / methods
  • Databases, Genetic
  • Datasets as Topic
  • Deep Learning*
  • Glycosylation
  • Glycosyltransferases / genetics
  • Glycosyltransferases / metabolism*
  • Protein Folding*
  • Protein Structure, Secondary / genetics
  • Protein Structure, Tertiary / genetics
  • Sequence Alignment

Substances

  • Glycosyltransferases