OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization

Gustaf Ahdritz; Nazim Bouatta; Christina Floristean; Sachin Kadyan; Qinghui Xia; William Gerecke; Timothy J O'Donnell; Daniel Berenberg; Ian Fisk; Niccolò Zanichelli; Bo Zhang; Arkadiusz Nowaczynski; Bei Wang; Marta M Stepniewska-Dziubinska; Shang Zhang; Adegoke Ojewole; Murat Efe Guney; Stella Biderman; Andrew M Watkins; Stephen Ra; Pablo Ribalta Lorenzo; Lucas Nivon; Brian Weitzner; Yih-En Andrew Ban; Shiyang Chen; Minjia Zhang; Conglong Li; Shuaiwen Leon Song; Yuxiong He; Peter K Sorger; Emad Mostaque; Zhao Zhang; Richard Bonneau; Mohammed AlQuraishi

doi:10.1038/s41592-024-02272-z

OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization

Nat Methods. 2024 May 14. doi: 10.1038/s41592-024-02272-z. Online ahead of print.

Authors

Gustaf Ahdritz^#^{1

2}, Nazim Bouatta^#³, Christina Floristean¹, Sachin Kadyan¹, Qinghui Xia¹, William Gerecke⁴, Timothy J O'Donnell⁵, Daniel Berenberg⁶, Ian Fisk⁷, Niccolò Zanichelli⁸, Bo Zhang⁹, Arkadiusz Nowaczynski¹⁰, Bei Wang¹⁰, Marta M Stepniewska-Dziubinska¹⁰, Shang Zhang¹⁰, Adegoke Ojewole¹⁰, Murat Efe Guney¹⁰, Stella Biderman^{11

12}, Andrew M Watkins¹³, Stephen Ra¹³, Pablo Ribalta Lorenzo¹⁰, Lucas Nivon¹⁴, Brian Weitzner¹⁵, Yih-En Andrew Ban¹⁶, Shiyang Chen¹⁷, Minjia Zhang¹⁸, Conglong Li¹⁹, Shuaiwen Leon Song¹⁹, Yuxiong He¹⁹, Peter K Sorger⁴, Emad Mostaque²⁰, Zhao Zhang¹⁷, Richard Bonneau¹³, Mohammed AlQuraishi²¹

Affiliations

¹ Department of Systems Biology, Columbia University, New York, NY, USA.
² Harvard University, Cambridge, MA, USA.
³ Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, USA. nbouatta@gmail.com.
⁴ Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, USA.
⁵ Icahn School of Medicine at Mount Sinai, New York, NY, USA.
⁶ Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA.
⁷ Flatiron Institute, New York, NY, USA.
⁸ OpenBioML, Cambridge, MA, USA.
⁹ Scientific Computing and Imaging Institute, University of Utah, Salt Lake City, UT, USA.
¹⁰ NVIDIA, Santa Clara, CA, USA.
¹¹ EleutherAI, New York, NY, USA.
¹² Booz Allen Hamilton, McLean, VA, USA.
¹³ Prescient Design, Genentech, New York, NY, USA.
¹⁴ Cyrus Bio, Seattle, WA, USA.
¹⁵ Outpace Bio, Seattle, WA, USA.
¹⁶ Arzeda, Seattle, WA, USA.
¹⁷ Rutgers University, New Brunswick, NJ, USA.
¹⁸ University of Illinois at Urbana-Champaign, Champaign, IL, USA.
¹⁹ Microsoft, Redmond, WA, USA.
²⁰ Stability AI, Los Altos, CA, USA.
²¹ Department of Systems Biology, Columbia University, New York, NY, USA. m.alquraishi@columbia.edu.

^# Contributed equally.

PMID: 38744917
DOI: 10.1038/s41592-024-02272-z

Abstract

AlphaFold2 revolutionized structural biology with the ability to predict protein structures with exceptionally high accuracy. Its implementation, however, lacks the code and data required to train new models. These are necessary to (1) tackle new tasks, like protein-ligand complex structure prediction, (2) investigate the process by which the model learns and (3) assess the model's capacity to generalize to unseen regions of fold space. Here we report OpenFold, a fast, memory efficient and trainable implementation of AlphaFold2. We train OpenFold from scratch, matching the accuracy of AlphaFold2. Having established parity, we find that OpenFold is remarkably robust at generalizing even when the size and diversity of its training set is deliberately limited, including near-complete elisions of classes of secondary structure elements. By analyzing intermediate structures produced during training, we also gain insights into the hierarchical manner in which OpenFold learns to fold. In sum, our studies demonstrate the power and utility of OpenFold, which we believe will prove to be a crucial resource for the protein modeling community.

Abstract

Grants and funding