Machine Learning: How Much Does It Tell about Protein Folding Rates?

Marc Corrales; Pol Cuscó; Dinara R Usmanova; Heng-Chang Chen; Natalya S Bogatyreva; Guillaume J Filion; Dmitry N Ivankov

doi:10.1371/journal.pone.0143166

Machine Learning: How Much Does It Tell about Protein Folding Rates?

PLoS One. 2015 Nov 25;10(11):e0143166. doi: 10.1371/journal.pone.0143166. eCollection 2015.

Authors

Marc Corrales^{1

2

3}, Pol Cuscó^{1

2

3}, Dinara R Usmanova^{2

4

5}, Heng-Chang Chen^{1

2

3}, Natalya S Bogatyreva^{2

4

6}, Guillaume J Filion^{1

2

3}, Dmitry N Ivankov^{2

4

6}

Affiliations

¹ Genome Architecture, Gene Regulation, Stem Cells and Cancer Programme, Centre for Genomic Regulation (CRG), Barcelona, Spain.
² Universitat Pompeu Fabra (UPF), Barcelona, Spain.
³ Spain Genome Architecture, Gene Regulation, Stem Cells and Cancer Programme, Centre for Genomic Regulation (CRG), Barcelona, Spain.
⁴ Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), Barcelona, Spain.
⁵ Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region, Russia.
⁶ Laboratory of Protein Physics, Institute of Protein Research of the Russian Academy of Sciences, Pushchino, Moscow Region, Russia.

Abstract

The prediction of protein folding rates is a necessary step towards understanding the principles of protein folding. Due to the increasing amount of experimental data, numerous protein folding models and predictors of protein folding rates have been developed in the last decade. The problem has also attracted the attention of scientists from computational fields, which led to the publication of several machine learning-based models to predict the rate of protein folding. Some of them claim to predict the logarithm of protein folding rate with an accuracy greater than 90%. However, there are reasons to believe that such claims are exaggerated due to large fluctuations and overfitting of the estimates. When we confronted three selected published models with new data, we found a much lower predictive power than reported in the original publications. Overly optimistic predictive powers appear from violations of the basic principles of machine-learning. We highlight common misconceptions in the studies claiming excessive predictive power and propose to use learning curves as a safeguard against those mistakes. As an example, we show that the current amount of experimental data is insufficient to build a linear predictor of logarithms of folding rates based on protein amino acid composition.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Machine Learning*
Protein Folding*
Proteins / chemistry*
Reproducibility of Results

Substances

Proteins

Grants and funding

NSB was supported by the Russian Science Foundation Grant 14-24-00157. DRU and DNI were supported by ERC grant 335980_EinME. The authors acknowledge support of the Spanish Ministry of Economy and Competitiveness, ‘Centro de Excelencia Severo Ochoa 2013-2017’, SEV-2012-0208. HCC and PC were supported by the Spanish Ministry of Economy and Competitiveness (including State Training Subprogram: predoctoral fellowships for the training of PhD students (FPI) 2013). MC and GF were supported by the CRG. The publication cost was covered by CRG.