Improving the Accuracy of Physics-Based Hydration-Free Energy Predictions by Machine Learning the Remaining Error Relative to the Experiment

Lewis Bass; Luke H Elder; Dan E Folescu; Negin Forouzesh; Igor S Tolokh; Anuj Karpatne; Alexey V Onufriev

doi:10.1021/acs.jctc.3c00981

Improving the Accuracy of Physics-Based Hydration-Free Energy Predictions by Machine Learning the Remaining Error Relative to the Experiment

J Chem Theory Comput. 2024 Jan 9;20(1):396-410. doi: 10.1021/acs.jctc.3c00981. Epub 2023 Dec 27.

Authors

Lewis Bass¹, Luke H Elder², Dan E Folescu^{2

3}, Negin Forouzesh⁴, Igor S Tolokh², Anuj Karpatne², Alexey V Onufriev^{2

5

6}

Affiliations

¹ Department of Computer Engineering, Virginia Tech, Blacksburg, Virginia 24061, United States.
² Department of Computer Science, Virginia Tech, Blacksburg, Virginia 24061, United States.
³ Department of Mathematics, Virginia Tech, Blacksburg, Virginia 24061, United States.
⁴ Department of Computer Science, California State University, Los Angeles, California 90032, United States.
⁵ Department of Physics, Virginia Tech, Blacksburg, Virginia 24061, United States.
⁶ Center for Soft Matter and Biological Physics, Virginia Tech, Blacksburg, Virginia 24061, United States.

Abstract

The accuracy of computational models of water is key to atomistic simulations of biomolecules. We propose a computationally efficient way to improve the accuracy of the prediction of hydration-free energies (HFEs) of small molecules: the remaining errors of the physics-based models relative to the experiment are predicted and mitigated by machine learning (ML) as a postprocessing step. Specifically, the trained graph convolutional neural network attempts to identify the "blind spots" in the physics-based model predictions, where the complex physics of aqueous solvation is poorly accounted for, and partially corrects for them. The strategy is explored for five classical solvent models representing various accuracy/speed trade-offs, from the fast analytical generalized Born (GB) to the popular TIP3P explicit solvent model; experimental HFEs of small neutral molecules from the FreeSolv set are used for the training and testing. For all of the models, the ML correction reduces the resulting root-mean-square error relative to the experiment for HFEs of small molecules, without significant overfitting and with negligible computational overhead. For example, on the test set, the relative accuracy improvement is 47% for the fast analytical GB, making it, after the ML correction, almost as accurate as uncorrected TIP3P. For the TIP3P model, the accuracy improvement is about 39%, bringing the ML-corrected model's accuracy below the 1 kcal/mol threshold. In general, the relative benefit of the ML corrections is smaller for more accurate physics-based models, reaching the lower limit of about 20% relative accuracy gain compared with that of the physics-based treatment alone. The proposed strategy of using ML to learn the remaining error of physics-based models offers a distinct advantage over training ML alone directly on reference HFEs: it preserves the correct overall trend, even well outside of the training set.

Abstract

Grants and funding