Interpretable Molecular Property Predictions Using Marginalized Graph Kernels

Yan Xiang; Yu-Hang Tang; Guang Lin; Daniel Reker

doi:10.1021/acs.jcim.3c00396

Interpretable Molecular Property Predictions Using Marginalized Graph Kernels

J Chem Inf Model. 2023 Aug 14;63(15):4633-4640. doi: 10.1021/acs.jcim.3c00396. Epub 2023 Jul 28.

Authors

Yan Xiang¹, Yu-Hang Tang², Guang Lin³, Daniel Reker¹

Affiliations

¹ Department of Biomedical Engineering, Duke University, Durham, North Carolina 27705, United States.
² Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720, United States.
³ Department of Mathematics & School of Mechanical Engineering, Purdue University, West Lafayette, Indiana 47907, United States.

PMID: 37504964
DOI: 10.1021/acs.jcim.3c00396

Abstract

Marginalized graph kernels have shown competitive performance in molecular machine learning tasks but currently lack measures of interpretability, which are important to improve trust in the models, detect biases, and inform molecular optimization campaigns. We here conceive and implement two interpretability measures for Gaussian process regression using a marginalized graph kernel (GPR-MGK) to quantify (1) the contribution of specific training data to the prediction and (2) the contribution of specific nodes of the graph to the prediction. We demonstrate the applicability of these interpretability measures for molecular property prediction. We compare GPR-MGK to graph neural networks on four logic and two real-world toxicology data sets and find that the atomic attribution of GPR-MGK generally outperforms the atomic attribution of graph neural networks. We also perform a detailed molecular attribution analysis using the FreeSolv data set, showing how molecules in the training set influence machine learning predictions and why Morgan fingerprints perform poorly on this data set. This is the first systematic examination of the interpretability of GPR-MGK and thereby is an important step in the further maturation of marginalized graph kernel methods for interpretable molecular predictions.