Systematic investigation of machine learning on limited data: A study on predicting protein-protein binding strength

Feifan Zheng; Xin Jiang; Yuhao Wen; Yan Yang; Minghui Li

doi:10.1016/j.csbj.2023.12.018

Systematic investigation of machine learning on limited data: A study on predicting protein-protein binding strength

Comput Struct Biotechnol J. 2023 Dec 20:23:460-472. doi: 10.1016/j.csbj.2023.12.018. eCollection 2024 Dec.

Authors

Feifan Zheng¹, Xin Jiang¹, Yuhao Wen¹, Yan Yang¹, Minghui Li¹

Affiliation

¹ MOE Key Laboratory of Geriatric Diseases and Immunology, School of Biology and Basic Medical Sciences, Suzhou Medical College of Soochow University, Suzhou, Jiangsu Province 215123, China.

Abstract

The application of machine learning techniques in biological research, especially when dealing with limited data availability, poses significant challenges. In this study, we leveraged advancements in method development for predicting protein-protein binding strength to conduct a systematic investigation into the application of machine learning on limited data. The binding strength, quantitatively measured as binding affinity, is vital for understanding the processes of recognition, association, and dysfunction that occur within protein complexes. By incorporating transfer learning, integrating domain knowledge, and employing both deep learning and traditional machine learning algorithms, we mitigated the impact of data limitations and made significant advancements in predicting protein-protein binding affinity. In particular, we developed over 20 models, ultimately selecting three representative best-performing ones that belong to distinct categories. The first model is structure-based, consisting of a random forest regression and thirteen handcrafted features. The second model is sequence-based, employing an architecture that combines transferred embedding features with a multilayer perceptron. Finally, we created an ensemble model by averaging the predictions of the two aforementioned models. The comparison with other predictors on three independent datasets confirms the significant improvements achieved by our models in predicting protein-protein binding affinity. The programs for running these three models are available at https://github.com/minghuilab/BindPPI.

Keywords: Machine learning methods; Protein-protein binding affinity; Tools.