Active Learning for Drug Design: A Case Study on the Plasma Exposure of Orally Administered Drugs

Xiaoyu Ding; Rongrong Cui; Jie Yu; Tiantian Liu; Tingfei Zhu; Dingyan Wang; Jie Chang; Zisheng Fan; Xiaomeng Liu; Kaixian Chen; Hualiang Jiang; Xutong Li; Xiaomin Luo; Mingyue Zheng

doi:10.1021/acs.jmedchem.1c01683

Active Learning for Drug Design: A Case Study on the Plasma Exposure of Orally Administered Drugs

J Med Chem. 2021 Nov 25;64(22):16838-16853. doi: 10.1021/acs.jmedchem.1c01683. Epub 2021 Nov 15.

Authors

Xiaoyu Ding^{1

2}, Rongrong Cui³, Jie Yu^{1

2}, Tiantian Liu^{1

2}, Tingfei Zhu^{1

2}, Dingyan Wang^{1

2}, Jie Chang³, Zisheng Fan³, Xiaomeng Liu^{1

2}, Kaixian Chen^{1

2

3}, Hualiang Jiang^{1

2

3

4}, Xutong Li^{1

2}, Xiaomin Luo^{1

2}, Mingyue Zheng^{1

2

3}

Affiliations

¹ Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China.
² University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China.
³ School of Chinese Materia Medica, Nanjing University of Chinese Medicine, 138 Xianlin Road, Nanjing 210023, China.
⁴ School of Life Science and Technology, ShanghaiTech University, 393 Huaxiazhong Road, Shanghai 200031, China.

PMID: 34779199
DOI: 10.1021/acs.jmedchem.1c01683

Abstract

The success of artificial intelligence (AI) models has been limited by the requirement of large amounts of high-quality training data, which is just the opposite of the situation in most drug discovery pipelines. Active learning (AL) is a subfield of AI that focuses on algorithms that select the data they need to improve their models. Here, we propose a two-phase AL pipeline and apply it to the prediction of drug oral plasma exposure. In phase I, the AL-based model demonstrated a remarkable capability to sample informative data from a noisy data set, which used only 30% of the training data to yield a prediction capability with an accuracy of 0.856 on an independent test set. In phase II, the AL-based model explored a large diverse chemical space (855K samples) for experimental testing and feedback. Improved accuracy and new highly confident predictions (50K samples) were observed, which suggest that the model's applicability domain has been significantly expanded.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Administration, Oral
Drug Design*
Machine Learning*
Pharmaceutical Preparations / blood*
Problem-Based Learning*

Substances

Pharmaceutical Preparations