Powerful molecule generation with simple ConvNet

Hongyang K Yu; Hongjiang C Yu

doi:10.1093/bioinformatics/btac332

Powerful molecule generation with simple ConvNet

Bioinformatics. 2022 Jun 27;38(13):3438-3443. doi: 10.1093/bioinformatics/btac332.

Authors

Hongyang K Yu¹, Hongjiang C Yu¹

Affiliation

¹ AI Drug Discovery, Anticancer Bioscience Ltd, Chengdu, China.

PMID: 35595245
DOI: 10.1093/bioinformatics/btac332

Abstract

Motivation: Automated molecule generation is a crucial step in in-silico drug discovery. Graph-based generation algorithms have seen significant progress over recent years. However, they are often complex to implement, hard to train and can under-perform when generating long-sequence molecules. The development of a simple and powerful alternative can help improve practicality of automated drug discovery method.

Results: We proposed a ConvNet-based sequential graph generation algorithm. The molecular graph generation problem is reformulated as a sequence of simple classification tasks. At each step, a convolutional neural network operates on a sub-graph that is generated at previous step, and predicts/classifies an atom/bond adding action to populate the input sub-graph. The proposed model is pretrained by learning to sequentially reconstruct existing molecules. The pretrained model is abbreviated as SEEM (structural encoder for engineering molecules). It is then fine-tuned with reinforcement learning to generate molecules with improved properties. The fine-tuned model is named SEED (structural encoder for engineering drug-like-molecules). The proposed models have demonstrated competitive performance comparing to 16 state-of-the-art baselines on three benchmark datasets.

Availability and implementation: Code is available at https://github.com/yuh8/SEEM and https://github.com/yuh8/SEED. QM9 dataset is availble at http://quantum-machine.org/datasets/, ZINC250k dataset is availble at https://raw.githubusercontent.com/aspuru-guzik-group/chemical_vae/master/models/zinc_properties/250k_rndm_zinc_drugs_clean_3.csv, and ChEMBL dataset is availble at https://www.ebi.ac.uk/chembl/.

Supplementary information: Supplementary data are available at Bioinformatics online.

MeSH terms

Algorithms*
Drug Discovery
Neural Networks, Computer*