Using Active Learning to Develop Machine Learning Models for Reaction Yield Prediction

Simon Viet Johansson; Hampus Gummesson Svensson; Esben Bjerrum; Alexander Schliep; Morteza Haghir Chehreghani; Christian Tyrchan; Ola Engkvist

doi:10.1002/minf.202200043

Using Active Learning to Develop Machine Learning Models for Reaction Yield Prediction

Mol Inform. 2022 Dec;41(12):e2200043. doi: 10.1002/minf.202200043. Epub 2022 Jul 14.

Authors

Simon Viet Johansson^{1

2}, Hampus Gummesson Svensson^{1

2}, Esben Bjerrum¹, Alexander Schliep², Morteza Haghir Chehreghani², Christian Tyrchan³, Ola Engkvist^{1

2}

Affiliations

¹ Molecular AI, Discovery Sciences, R&D, AstraZeneca, SE-431 83, Mölndal, Sweden.
² Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg, SE-412 96, Göteborg, Sweden.
³ Medicinal Chemistry, Research and Early Development, Respiratory and Immunology (R&I), BioPharmaceuticals R&D, AstraZeneca, SE-431 83, Mölndal, Sweden.

PMID: 35732584
DOI: 10.1002/minf.202200043

Abstract

Computer aided synthesis planning, suggesting synthetic routes for molecules of interest, is a rapidly growing field. The machine learning methods used are often dependent on access to large datasets for training, but finite experimental budgets limit how much data can be obtained from experiments. This suggests the use of schemes for data collection such as active learning, which identifies the data points of highest impact for model accuracy, and which has been used in recent studies with success. However, little has been done to explore the robustness of the methods predicting reaction yield when used together with active learning to reduce the amount of experimental data needed for training. This study aims to investigate the influence of machine learning algorithms and the number of initial data points on reaction yield prediction for two public high-throughput experimentation datasets. Our results show that active learning based on output margin reached a pre-defined AUROC faster than random sampling on both datasets. Analysis of feature importance of the trained machine learning models suggests active learning had a larger influence on the model accuracy when only a few features were important for the model prediction.

Keywords: Active Learning; Bayesian Matrix Factorization; Neural Networks; Random Forest; Reaction Yield Prediction.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Machine Learning*