Computational design of novel Cas9 PAM-interacting domains using evolution-based modelling and structural quality assessment

Cyril Malbranke; William Rostain; Florence Depardieu; Simona Cocco; Rémi Monasson; David Bikard

doi:10.1371/journal.pcbi.1011621

Computational design of novel Cas9 PAM-interacting domains using evolution-based modelling and structural quality assessment

PLoS Comput Biol. 2023 Nov 17;19(11):e1011621. doi: 10.1371/journal.pcbi.1011621. eCollection 2023 Nov.

Authors

Cyril Malbranke^{1

2}, William Rostain², Florence Depardieu², Simona Cocco¹, Rémi Monasson¹, David Bikard²

Affiliations

¹ Laboratory of Physics of the Ecole Normale Superieure, PSL Research, CNRS UMR 8023, Sorbonne Université, Paris, France.
² Institut Pasteur, Université Paris Cité, CNRS UMR 6047, Synthetic Biology, Paris, France.

Abstract

We present here an approach to protein design that combines (i) scarce functional information such as experimental data (ii) evolutionary information learned from a natural sequence variants and (iii) physics-grounded modeling. Using a Restricted Boltzmann Machine (RBM), we learn a sequence model of a protein family. We use semi-supervision to leverage available functional information during the RBM training. We then propose a strategy to explore the protein representation space that can be informed by external models such as an empirical force-field method (FoldX). Our approach is applied to a domain of the Cas9 protein responsible for recognition of a short DNA motif. We experimentally assess the functionality of 71 variants generated to explore a range of RBM and FoldX energies. Sequences with as many as 50 differences (20% of the protein domain) to the wild-type retained functionality. Overall, 21/71 sequences designed with our method were functional. Interestingly, 6/71 sequences showed an improved activity in comparison with the original wild-type protein sequence. These results demonstrate the interest in further exploring the synergies between machine-learning of protein sequence representations and physics grounded modeling strategies informed by structural information.

Copyright: © 2023 Malbranke et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

MeSH terms

Amino Acid Sequence
CRISPR-Cas Systems*
Learning
Machine Learning
Proteins* / chemistry
Proteins* / genetics

Substances

Proteins

Grants and funding

SC and RM were supported by the Agence Nationale de la Recherche grant numbers ANR-17-CE30-0021 RBMPro and ANR-19-CE30-0021 Decrypted. CM is recipient of a PhD funding from AMX program, École polytechnique and benefits from financial support from the Centre de Recherche Interdisciplinary (CRI) through ”École Doctorale Frontiéres de l’Innovation en Recherche et Education – Programme Bettencourt”. DB, WR and FD were supported by European Research Council [677823], European Research Council [101044479], Agence Nationale de la Recherche [ANR-10-LABX-62-IBEID]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.