Transformer-based multitask learning for reaction prediction under low-resource circumstances

Haoran Qiao; Yejian Wu; Yun Zhang; Chengyun Zhang; Xinyi Wu; Zhipeng Wu; Qingjie Zhao; Xinqiao Wang; Huiyu Li; Hongliang Duan

doi:10.1039/d2ra05349g

Transformer-based multitask learning for reaction prediction under low-resource circumstances

RSC Adv. 2022 Nov 8;12(49):32020-32026. doi: 10.1039/d2ra05349g. eCollection 2022 Nov 3.

Authors

Haoran Qiao¹, Yejian Wu², Yun Zhang², Chengyun Zhang², Xinyi Wu², Zhipeng Wu², Qingjie Zhao³, Xinqiao Wang², Huiyu Li¹, Hongliang Duan^{2

4}

Affiliations

¹ College of Mathematics and Physics, Shanghai University of Electric Power Shanghai 200090 China huiyuli@shiep.edu.cn.
² Artificial Intelligence Aided Drug Discovery Institute, College of Pharmaceutical Sciences, Zhejiang University of Technology Hangzhou 310014 China hduan@zjut.edu.cn.
³ Innovation Research Institute of Traditional Chinese Medicine, Shanghai University of Traditional Chinese Medicine Shanghai 201203 China.
⁴ State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica (SIMM), Chinese Academy of Sciences Shanghai 201203 China.

Abstract

Recently, effective and rapid deep-learning methods for predicting chemical reactions have significantly aided the research and development of organic chemistry and drug discovery. Owing to the insufficiency of related chemical reaction data, computer-assisted predictions based on low-resource chemical datasets generally have low accuracy despite the exceptional ability of deep learning in retrosynthesis and synthesis. To address this issue, we introduce two types of multitask models: retro-forward reaction prediction transformer (RFRPT) and multiforward reaction prediction transformer (MFRPT). These models integrate multitask learning with the transformer model to predict low-resource reactions in forward reaction prediction and retrosynthesis. Our results demonstrate that introducing multitask learning significantly improves the average top-1 accuracy, and the RFRPT (76.9%) and MFRPT (79.8%) outperform the transformer baseline model (69.9%). These results also demonstrate that a multitask framework can capture sufficient chemical knowledge and effectively mitigate the impact of the deficiency of low-resource data in processing reaction prediction tasks. Both RFRPT and MFRPT methods significantly improve the predictive performance of transformer models, which are powerful methods for eliminating the restriction of limited training data.