The promises of large language models for protein design and modeling

Giorgio Valentini; Dario Malchiodi; Jessica Gliozzo; Marco Mesiti; Mauricio Soto-Gomez; Alberto Cabri; Justin Reese; Elena Casiraghi; Peter N Robinson

doi:10.3389/fbinf.2023.1304099

The promises of large language models for protein design and modeling

Front Bioinform. 2023 Nov 23:3:1304099. doi: 10.3389/fbinf.2023.1304099. eCollection 2023.

Authors

Giorgio Valentini^{1

2}, Dario Malchiodi¹, Jessica Gliozzo^{1

3}, Marco Mesiti¹, Mauricio Soto-Gomez¹, Alberto Cabri¹, Justin Reese⁴, Elena Casiraghi^{1

2

4}, Peter N Robinson⁵

Affiliations

¹ AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy.
² ELLIS, European Laboratory for Learning and Intelligent Systems, Milan, Italy.
³ European Commission, Joint Research Centre (JRC), Ispra, Italy.
⁴ Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, United States.
⁵ Jackson Lab for Genomic Medicine, Farmington, CT, United States.

Abstract

The recent breakthroughs of Large Language Models (LLMs) in the context of natural language processing have opened the way to significant advances in protein research. Indeed, the relationships between human natural language and the "language of proteins" invite the application and adaptation of LLMs to protein modelling and design. Considering the impressive results of GPT-4 and other recently developed LLMs in processing, generating and translating human languages, we anticipate analogous results with the language of proteins. Indeed, protein language models have been already trained to accurately predict protein properties, generate novel functionally characterized proteins, achieving state-of-the-art results. In this paper we discuss the promises and the open challenges raised by this novel and exciting research area, and we propose our perspective on how LLMs will affect protein modeling and design.

Keywords: deep learning; large language models; protein design; protein engineering; protein modeling; transformers.

Grants and funding

The authors declare financial support was received for the research, authorship, and/or publication of this article. This research was supported by the “National Center for Gene Therapy and Drugs based on RNA Technology,” PNRR-NextGenerationEU program [G43C22001320007], Director, Office of Science, Office of Basic Energy Sciences of the U.S. Department of Energy Contract No. DE-AC02-05CH11231, and was realised with the collaboration of the European Commission Joint Research Centre under the Collaborative Doctoral Partnership Agreement No. 35454.