Machine Learning Prediction of Nine Molecular Properties Based on the SMILES Representation of the QM9 Quantum-Chemistry Dataset

Gabriel A Pinheiro; Johnatan Mucelini; Marinalva D Soares; Ronaldo C Prati; Juarez L F Da Silva; Marcos G Quiles

doi:10.1021/acs.jpca.0c05969

Machine Learning Prediction of Nine Molecular Properties Based on the SMILES Representation of the QM9 Quantum-Chemistry Dataset

J Phys Chem A. 2020 Nov 25;124(47):9854-9866. doi: 10.1021/acs.jpca.0c05969. Epub 2020 Nov 11.

Authors

Gabriel A Pinheiro¹, Johnatan Mucelini², Marinalva D Soares³, Ronaldo C Prati⁴, Juarez L F Da Silva², Marcos G Quiles⁵

Affiliations

¹ Associate Laboratory for Computing and Applied Mathematics, National Institute for Space Research, PO BOX 515, 12227-010, São José dos Campos, SP, Brazil.
² São Carlos Institute of Chemistry, University of São Paulo, PO Box 780, 13560-970, São Carlos, SP, Brazil.
³ Institute of Science and Technology, Federal University of São Paulo (Unifesp), 12247-014, São José dos Campos, SP, Brazil.
⁴ Center of Mathematics, Computation and Cognition, Federal University of ABC, Av. Dos Estados, 5001, 09210-580, Santo André, SP, Brazil.
⁵ Institute of Science and Technology, Federal University of São Paulo, 12247-014, São José dos Campos, SP, Brazil.

PMID: 33174750
DOI: 10.1021/acs.jpca.0c05969

Abstract

Machine learning (ML) models can potentially accelerate the discovery of tailored materials by learning a function that maps chemical compounds into their respective target properties. In this realm, a crucial step is encoding the molecular systems into the ML model, in which the molecular representation plays a crucial role. Most of the representations are based on the use of atomic coordinates (structure); however, it can increase ML training and predictions' computational cost. Herein, we investigate the impact of choosing free-coordinate descriptors based on the Simplified Molecular Input Line Entry System (SMILES) representation, which can substantially reduce the ML predictions' computational cost. Therefore, we evaluate a feed-forward neural network (FNN) model's prediction performance over five feature selection methods and nine ground-state properties (including energetic, electronic, and thermodynamic properties) from a public data set composed of ∼130k organic molecules. Our best results reached a mean absolute error, close to chemical accuracy, of ∼0.05 eV for the atomization energies (internal energy at 0 K, internal energy at 298.15 K, enthalpy at 298.15 K, and free energy at 298.15 K). Moreover, for the atomization energies, the results obtained an out-of-sample error nine times less than the same FNN model trained with the Coulomb matrix, a traditional coordinate-based descriptor. Furthermore, our results showed how limited the model's accuracy is by employing such low computational cost representation that carries less information about the molecular structure than the most state-of-the-art methods.