The impact of cross-docked poses on performance of machine learning classifier for protein-ligand binding pose prediction

Chao Shen; Xueping Hu; Junbo Gao; Xujun Zhang; Haiyang Zhong; Zhe Wang; Lei Xu; Yu Kang; Dongsheng Cao; Tingjun Hou

doi:10.1186/s13321-021-00560-w

The impact of cross-docked poses on performance of machine learning classifier for protein-ligand binding pose prediction

J Cheminform. 2021 Oct 16;13(1):81. doi: 10.1186/s13321-021-00560-w.

Authors

Chao Shen^{1

2}, Xueping Hu¹, Junbo Gao¹, Xujun Zhang¹, Haiyang Zhong¹, Zhe Wang¹, Lei Xu³, Yu Kang⁴, Dongsheng Cao⁵, Tingjun Hou^{6

7}

Affiliations

¹ Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang, 310058, People's Republic of China.
² State Key Lab of CAD&CG, Zhejiang University, Hangzhou, Zhejiang, 310058, People's Republic of China.
³ Institute of Bioinformatics and Medical Engineering, School of Electrical and Information Engineering, Jiangsu University of Technology, Changzhou, 213001, China.
⁴ Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang, 310058, People's Republic of China. yukang@zju.edu.cn.
⁵ Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, Hunan, 410013, People's Republic of China. oriental-cds@163.com.
⁶ Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang, 310058, People's Republic of China. tingjunhou@zju.edu.cn.
⁷ State Key Lab of CAD&CG, Zhejiang University, Hangzhou, Zhejiang, 310058, People's Republic of China. tingjunhou@zju.edu.cn.

Abstract

Structure-based drug design depends on the detailed knowledge of the three-dimensional (3D) structures of protein-ligand binding complexes, but accurate prediction of ligand-binding poses is still a major challenge for molecular docking due to deficiency of scoring functions (SFs) and ignorance of protein flexibility upon ligand binding. In this study, based on a cross-docking dataset dedicatedly constructed from the PDBbind database, we developed several XGBoost-trained classifiers to discriminate the near-native binding poses from decoys, and systematically assessed their performance with/without the involvement of the cross-docked poses in the training/test sets. The calculation results illustrate that using Extended Connectivity Interaction Features (ECIF), Vina energy terms and docking pose ranks as the features can achieve the best performance, according to the validation through the random splitting or refined-core splitting and the testing on the re-docked or cross-docked poses. Besides, it is found that, despite the significant decrease of the performance for the threefold clustered cross-validation, the inclusion of the Vina energy terms can effectively ensure the lower limit of the performance of the models and thus improve their generalization capability. Furthermore, our calculation results also highlight the importance of the incorporation of the cross-docked poses into the training of the SFs with wide application domain and high robustness for binding pose prediction. The source code and the newly-developed cross-docking datasets can be freely available at https://github.com/sc8668/ml_pose_prediction and https://zenodo.org/record/5525936 , respectively, under an open-source license. We believe that our study may provide valuable guidance for the development and assessment of new machine learning-based SFs (MLSFs) for the predictions of protein-ligand binding poses.

Keywords: Cross-docking; Machine learning (ML); Molecular docking; Protein–ligand binding pose; Scoring function (SF).

Abstract

Grants and funding