An ensemble-based approach to estimate confidence of predicted protein-ligand binding affinity values

Mol Inform. 2024 Apr;43(4):e202300292. doi: 10.1002/minf.202300292. Epub 2024 Feb 15.

Abstract

When designing a machine learning-based scoring function, we access a limited number of protein-ligand complexes with experimentally determined binding affinity values, representing only a fraction of all possible protein-ligand complexes. Consequently, it is crucial to report a measure of confidence and quantify the uncertainty in the model's predictions during test time. Here, we adopt the conformal prediction technique to evaluate the confidence of a prediction for each member of the core set of the CASF 2016 benchmark. The conformal prediction technique requires a diverse ensemble of predictors for uncertainty estimation. To this end, we introduce ENS-Score as an ensemble predictor, which includes 30 models with different protein-ligand representation approaches and achieves Pearson's correlation of 0.842 on the core set of the CASF 2016 benchmark. Also, we comprehensively investigate the residual error of each data point to assess the normality behavior of the distribution of the residual errors and their correlation to the structural features of the ligands, such as hydrophobic interactions and halogen bonding. In the end, we provide a local host web application to facilitate the usage of ENS-Score. All codes to repeat results are provided at https://github.com/miladrayka/ENS_Score.

Keywords: PDBbind; conformal prediction; ensemble learning; molecular docking; protein-ligand binding affinity; scoring function; uncertainty quantification.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Ligands
  • Machine Learning*
  • Protein Binding*
  • Proteins* / chemistry
  • Proteins* / metabolism

Substances

  • Ligands
  • Proteins