Protein language-model embeddings for fast, accurate, and alignment-free protein structure prediction

Konstantin Weissenow; Michael Heinzinger; Burkhard Rost

doi:10.1016/j.str.2022.05.001

Protein language-model embeddings for fast, accurate, and alignment-free protein structure prediction

Structure. 2022 Aug 4;30(8):1169-1177.e4. doi: 10.1016/j.str.2022.05.001. Epub 2022 May 23.

Authors

Konstantin Weissenow¹, Michael Heinzinger², Burkhard Rost³

Affiliations

¹ TUM (Technical University of Munich), Department of Informatics, Bioinformatics and Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany; TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748 Garching, Germany. Electronic address: k.weissenow@tum.de.
² TUM (Technical University of Munich), Department of Informatics, Bioinformatics and Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany; TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748 Garching, Germany.
³ TUM (Technical University of Munich), Department of Informatics, Bioinformatics and Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany; Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748 Garching/Munich, Germany & TUM School of Life Sciences Weihenstephan (WZW), Alte Akademie 8, Freising, Germany.

PMID: 35609601
DOI: 10.1016/j.str.2022.05.001

Abstract

Advanced protein structure prediction requires evolutionary information from multiple sequence alignments (MSAs) from evolutionary couplings that are not always available. Artificial intelligence (AI)-based predictions inputting only single sequences are faster but so inaccurate as to render speed irrelevant. Here, we described a competitive prediction of inter-residue distances (2D structure) exclusively inputting embeddings from pre-trained protein language models (pLMs), namely ProtT5, from single sequences into a convolutional neural network (CNN) with relatively few layers. The major advance used the ProtT5 attention heads. Our new method, EMBER2, which never requires any MSAs, performed similarly to other methods that fully rely on co-evolution. Although clearly not reaching AlphaFold2, our leaner solution came somehow close at substantially lower costs. By generating protein-specific rather than family-averaged predictions, EMBER2 might better capture some features of particular protein structures. Results from using protein engineering and deep mutational scanning (DMS) experiments provided at least a proof of principle for such a speculation.

Keywords: deep learning; machine learning; multiple sequence alignments; protein language model; protein structure prediction.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Artificial Intelligence
Computational Biology* / methods
Language*
Proteins / chemistry
Sequence Alignment

Substances

Proteins