MultiGran-SMILES: multi-granularity SMILES learning for molecular property prediction

Bioinformatics. 2022 Sep 30;38(19):4573-4580. doi: 10.1093/bioinformatics/btac550.

Abstract

Motivation: Extracting useful molecular features is essential for molecular property prediction. Atom-level representation is a common representation of molecules, ignoring the sub-structure or branch information of molecules to some extent; however, it is vice versa for the substring-level representation. Both atom-level and substring-level representations may lose the neighborhood or spatial information of molecules. While molecular graph representation aggregating the neighborhood information of a molecule has a weak ability in expressing the chiral molecules or symmetrical structure. In this article, we aim to make use of the advantages of representations in different granularities simultaneously for molecular property prediction. To this end, we propose a fusion model named MultiGran-SMILES, which integrates the molecular features of atoms, sub-structures and graphs from the input. Compared with the single granularity representation of molecules, our method leverages the advantages of various granularity representations simultaneously and adjusts the contribution of each type of representation adaptively for molecular property prediction.

Results: The experimental results show that our MultiGran-SMILES method achieves state-of-the-art performance on BBBP, LogP, HIV and ClinTox datasets. For the BACE, FDA and Tox21 datasets, the results are comparable with the state-of-the-art models. Moreover, the experimental results show that the gains of our proposed method are bigger for the molecules with obvious functional groups or branches.

Availability and implementation: The code and data underlying this work are available on GitHub at https://github. com/Jiangjing0122/MultiGran.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

  • Research Support, Non-U.S. Gov't