A feature transferring workflow between data-poor compounds in various tasks

Xiaofei Sun; Jingyuan Zhu; Bin Chen; Hengzhi You; Huiqing Xu

doi:10.1371/journal.pone.0266088

A feature transferring workflow between data-poor compounds in various tasks

PLoS One. 2022 Mar 30;17(3):e0266088. doi: 10.1371/journal.pone.0266088. eCollection 2022.

Authors

Xiaofei Sun^{1

2}, Jingyuan Zhu³, Bin Chen^{2

4}, Hengzhi You³, Huiqing Xu⁵

Affiliations

¹ Chengdu Institute of Computer Application, Chinese Academy of Sciences, Chengdu, Sichuan, China.
² University of Chinese Academy of Sciences, Beijing, China.
³ School of science, Harbin Institute of Technology, Shenzhen, Guangdong, China.
⁴ IRIAI, Harbin Institute of Technology, Shenzhen, Guangdong, China.
⁵ Guangdong Energy Group Science and Technology Research Institute Co., Ltd., Guangzhou, Guangdong, China.

Abstract

Compound screening by in silico approaches has advantages in identifying high-activity leading compounds and can predict the safety of the drug. A key challenge is that the number of observations of drug activity and toxicity accumulation varies by target in different datasets, some of which are more understudied than others. Owing to an overall insufficiency and imbalance of drug data, it is hard to accurately predict drug activity and toxicity of multiple tasks by the existing models. To solve this problem, this paper proposed a two-stage transfer learning workflow to develop a novel prediction model, which can accurately predict drug activity and toxicity of the targets with insufficient observations. We built a balanced dataset based on the Tox21 dataset and developed a drug activity and toxicity prediction model based on Siamese networks and graph convolution to produce multitasking output. We also took advantage of transfer learning from data-rich targets to data-poor targets. We showed greater accuracy in predicting the activity and toxicity of compounds to targets with rich data and poor data. In Tox21, a relatively rich dataset, the prediction model accuracy for classification tasks was 0.877 AUROC. In the other five unbalanced datasets, we also found that transfer learning strategies brought the accuracy of models to a higher level in understudied targets. Our models can overcome the imbalance in target data and predict the compound activity and toxicity of understudied targets to help prioritize upcoming biological experiments.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Workflow*

Grants and funding

This work is partially supported by the Shenzhen Science and Technology Research Fund (JCYJ20190806142203709; JSGG20191129114029286; JSGG20201103153807021), Talent Development Starting Fund from Shenzhen Government (HA11409030), and Guangdong Province Basic and Applied Basic Research Fund Project (2021A1515110366). There was no additional external funding received for this study. The funder provided support in the form of salaries for authors [XS, JZ, BC, HY and HX], but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section.