Enhancing Molecular Property Prediction through Task-Oriented Transfer Learning: Integrating Universal Structural Insights and Domain-Specific Knowledge

Yanjing Duan; Xixi Yang; Xiangxiang Zeng; Wenxuan Wang; Youchao Deng; Dongsheng Cao

doi:10.1021/acs.jmedchem.4c00692

Enhancing Molecular Property Prediction through Task-Oriented Transfer Learning: Integrating Universal Structural Insights and Domain-Specific Knowledge

J Med Chem. 2024 May 15. doi: 10.1021/acs.jmedchem.4c00692. Online ahead of print.

Authors

Yanjing Duan¹, Xixi Yang², Xiangxiang Zeng², Wenxuan Wang¹, Youchao Deng¹, Dongsheng Cao^{1

3}

Affiliations

¹ Xiangya School of Pharmaceutical Sciences, Central South University, Changsha Hunan 410013, P. R. China.
² College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan 410013, P. R. China.
³ Institute for Advancing Translational Medicine in Bone & Joint Diseases, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong SAR 999077, P. R. China.

PMID: 38748846
DOI: 10.1021/acs.jmedchem.4c00692

Abstract

Precisely predicting molecular properties is crucial in drug discovery, but the scarcity of labeled data poses a challenge for applying deep learning methods. While large-scale self-supervised pretraining has proven an effective solution, it often neglects domain-specific knowledge. To tackle this issue, we introduce Task-Oriented Multilevel Learning based on BERT (TOML-BERT), a dual-level pretraining framework that considers both structural patterns and domain knowledge of molecules. TOML-BERT achieved state-of-the-art prediction performance on 10 pharmaceutical datasets. It has the capability to mine contextual information within molecular structures and extract domain knowledge from massive pseudo-labeled data. The dual-level pretraining accomplished significant positive transfer, with its two components making complementary contributions. Interpretive analysis elucidated that the effectiveness of the dual-level pretraining lies in the prior learning of a task-related molecular representation. Overall, TOML-BERT demonstrates the potential of combining multiple pretraining tasks to extract task-oriented knowledge, advancing molecular property prediction in drug discovery.