Two sequence- and two structure-based ML models have learned different aspects of protein biochemistry

Anastasiya V Kulikova; Daniel J Diaz; Tianlong Chen; T Jeffrey Cole; Andrew D Ellington; Claus O Wilke

doi:10.1038/s41598-023-40247-w

Two sequence- and two structure-based ML models have learned different aspects of protein biochemistry

Sci Rep. 2023 Aug 16;13(1):13280. doi: 10.1038/s41598-023-40247-w.

Authors

Anastasiya V Kulikova^{1

2}, Daniel J Diaz^{3

2

4}, Tianlong Chen^{4

5}, T Jeffrey Cole¹, Andrew D Ellington², Claus O Wilke⁶

Affiliations

¹ Department of Integrative Biology, University of Texas at Austin, Austin, TX, USA.
² The Department of Molecular Biosciences, Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, TX, USA.
³ Department of Chemistry, The University of Texas at Austin, Austin, TX, USA.
⁴ Institute for Foundations of Machine Learning (IFML), The University of Texas at Austin, Austin, TX, USA.
⁵ Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA.
⁶ Department of Integrative Biology, University of Texas at Austin, Austin, TX, USA. wilke@austin.utexas.edu.

Abstract

Deep learning models are seeing increased use as methods to predict mutational effects or allowed mutations in proteins. The models commonly used for these purposes include large language models (LLMs) and 3D Convolutional Neural Networks (CNNs). These two model types have very different architectures and are commonly trained on different representations of proteins. LLMs make use of the transformer architecture and are trained purely on protein sequences whereas 3D CNNs are trained on voxelized representations of local protein structure. While comparable overall prediction accuracies have been reported for both types of models, it is not known to what extent these models make comparable specific predictions and/or generalize protein biochemistry in similar ways. Here, we perform a systematic comparison of two LLMs and two structure-based models (CNNs) and show that the different model types have distinct strengths and weaknesses. The overall prediction accuracies are largely uncorrelated between the sequence- and structure-based models. Overall, the two structure-based models are better at predicting buried aliphatic and hydrophobic residues whereas the two LLMs are better at predicting solvent-exposed polar and charged amino acids. Finally, we find that a combined model that takes the individual model predictions as input can leverage these individual model strengths and results in significantly improved overall prediction accuracy.

Publication types

Research Support, U.S. Gov't, Non-P.H.S.
Research Support, Non-U.S. Gov't
Research Support, N.I.H., Extramural

MeSH terms

Amino Acid Sequence
Amino Acids*
Antifibrinolytic Agents*
Electric Power Supplies
Language

Substances

Amino Acids
Antifibrinolytic Agents

Grants and funding

R01 AI148419/AI/NIAID NIH HHS/United States