A systematic analysis of regression models for protein engineering

Richard Michael; Jacob Kæstel-Hansen; Peter Mørch Groth; Simon Bartels; Jesper Salomon; Pengfei Tian; Nikos S Hatzakis; Wouter Boomsma

doi:10.1371/journal.pcbi.1012061

A systematic analysis of regression models for protein engineering

PLoS Comput Biol. 2024 May 3;20(5):e1012061. doi: 10.1371/journal.pcbi.1012061. eCollection 2024 May.

Authors

Richard Michael¹, Jacob Kæstel-Hansen², Peter Mørch Groth^{1

3}, Simon Bartels¹, Jesper Salomon³, Pengfei Tian³, Nikos S Hatzakis², Wouter Boomsma¹

Affiliations

¹ Department of Computer Science, University of Copenhagen, Copenhagen, Denmark.
² Department of Chemistry, University of Copenhagen, Copenhagen, Denmark.
³ Enzyme Research, Novozymes A/S, Kongens Lyngby, Denmark.

Abstract

To optimize proteins for particular traits holds great promise for industrial and pharmaceutical purposes. Machine Learning is increasingly applied in this field to predict properties of proteins, thereby guiding the experimental optimization process. A natural question is: How much progress are we making with such predictions, and how important is the choice of regressor and representation? In this paper, we demonstrate that different assessment criteria for regressor performance can lead to dramatically different conclusions, depending on the choice of metric, and how one defines generalization. We highlight the fundamental issues of sample bias in typical regression scenarios and how this can lead to misleading conclusions about regressor performance. Finally, we make the case for the importance of calibrated uncertainty in this domain.

Copyright: © 2024 Michael et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

MeSH terms

Algorithms
Computational Biology* / methods
Machine Learning*
Protein Engineering* / methods
Proteins / chemistry
Regression Analysis

Grants and funding

This work was in part supported by the Danish Data Science Academy (to RM ddsa.dk, DDSA-PhD-2022-010 which is funded by the Novo Nordisk Foundation, NNF21SA0069429, novonordiskfonden.dk, and VILLUM FONDEN, 40516, veluxfoundations.dk). Further funding includes the NNF Center for 4D cellular dynamics (to NSH, NNF22OC0075851, novonordiskfonden.dk) and Villum Synergy (to NSH and WB, veluxfoundations.dk, DeepDesign 40578), the Innovation Fund Denmark (to WB and PMG, innovationsfonden.dk, 1044-00158A), the MLLS Center (Basic Machine Learning Research in Life Science, novonordiskfonden.dk, NNF20OC0062606), Digital Pilot Hub (to SB, Skylab Digital, Danish Ministry of Education and Science), and the Pioneer Centre for AI (to RM, PMG, SB, WB, Danish National Research Foundation, dg.dk, grant number P1). The funders played no role in study design, data collection, analysis, decision to publish, or preparation of the manuscript.