Efficient Exploration of Chemical Compound Space Using Active Learning for Prediction of Thermodynamic Properties of Alkane Molecules

J Chem Inf Model. 2023 Nov 13;63(21):6515-6524. doi: 10.1021/acs.jcim.3c01430. Epub 2023 Oct 19.

Abstract

We introduce an exploratory active learning (AL) algorithm using Gaussian process regression and marginalized graph kernel (GPR-MGK) to sample chemical compound space (CCS) at minimal cost. Targeting 251,728 enumerated alkane molecules with 4-19 carbon atoms, we applied the AL algorithm to select a diverse and representative set of molecules and then conducted high-throughput molecular simulations on these selected molecules. To demonstrate the power of the AL algorithm, we built directed message-passing neural networks (D-MPNN) using simulation data as the training set to predict liquid densities, heat capacities, and vaporization enthalpies of the CCS. Validations show that D-MPNN models built on the smallest training set considered in this work, which consists of 313 molecules or 0.124% of the original CCS, predict the properties with R2 > 0.99 against the computational data and R2 > 0.94 against the experimental data. The advantage of the presented AL algorithm is that the predicted uncertainty of GPR depends on only the molecular structures, which renders it compatible with high-throughput data generation.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Alkanes*
  • Molecular Structure
  • Neural Networks, Computer*
  • Thermodynamics

Substances

  • Alkanes