Protein embeddings and deep learning predict binding residues for various ligand classes

Maria Littmann; Michael Heinzinger; Christian Dallago; Konstantin Weissenow; Burkhard Rost

doi:10.1038/s41598-021-03431-4

Protein embeddings and deep learning predict binding residues for various ligand classes

Sci Rep. 2021 Dec 13;11(1):23916. doi: 10.1038/s41598-021-03431-4.

Authors

Maria Littmann¹, Michael Heinzinger^{2

3}, Christian Dallago^{2

3}, Konstantin Weissenow^{2

3}, Burkhard Rost^{2

4

5

6}

Affiliations

¹ Department of Informatics, Bioinformatics and Computational Biology, I12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748, Garching/Munich, Germany. littmann@rostlab.org.
² Department of Informatics, Bioinformatics and Computational Biology, I12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748, Garching/Munich, Germany.
³ TUM Graduate School, Center of Doctoral Studies in Informatics and Its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany.
⁴ Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, Garching, 85748, Munich, Germany.
⁵ TUM School of Life Sciences Weihenstephan (TUM-WZW), Alte Akademie 8, Freising, Germany.
⁶ Department of Biochemistry and Molecular Biophysics, Columbia University, 701 West, 168th Street, New York, NY, 10032, USA.

Abstract

One important aspect of protein function is the binding of proteins to ligands, including small molecules, metal ions, and macromolecules such as DNA or RNA. Despite decades of experimental progress many binding sites remain obscure. Here, we proposed bindEmbed21, a method predicting whether a protein residue binds to metal ions, nucleic acids, or small molecules. The Artificial Intelligence (AI)-based method exclusively uses embeddings from the Transformer-based protein Language Model (pLM) ProtT5 as input. Using only single sequences without creating multiple sequence alignments (MSAs), bindEmbed21DL outperformed MSA-based predictions. Combination with homology-based inference increased performance to F1 = 48 ± 3% (95% CI) and MCC = 0.46 ± 0.04 when merging all three ligand classes into one. All results were confirmed by three independent data sets. Focusing on very reliably predicted residues could complement experimental evidence: For the 25% most strongly predicted binding residues, at least 73% were correctly predicted even when ignoring the problem of missing experimental annotations. The new method bindEmbed21 is fast, simple, and broadly applicable-neither using structure nor MSAs. Thereby, it found binding residues in over 42% of all human proteins not otherwise implied in binding and predicted about 6% of all residues as binding to metal ions, nucleic acids, or small molecules.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Binding Sites
Deep Learning*
Ligands
Metals / chemistry
Molecular Docking Simulation / methods*
Nucleic Acids / chemistry
Protein Binding
Protein Conformation
Sequence Analysis, Protein / methods*
Software

Substances

Ligands
Metals
Nucleic Acids