Molecular language models: RNNs or transformer?

Yangyang Chen; Zixu Wang; Xiangxiang Zeng; Yayang Li; Pengyong Li; Xiucai Ye; Tetsuya Sakurai

doi:10.1093/bfgp/elad012

Molecular language models: RNNs or transformer?

Brief Funct Genomics. 2023 Jul 17;22(4):392-400. doi: 10.1093/bfgp/elad012.

Authors

Yangyang Chen¹, Zixu Wang¹, Xiangxiang Zeng², Yayang Li³, Pengyong Li⁴, Xiucai Ye¹, Tetsuya Sakurai¹

Affiliations

¹ Department of Computer Science, University of Tsukuba, Tsukuba 3058577, Japan.
² College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan, 410082, P.R. China.
³ School of Computer Science, South China Normal University, Guangzhou, Guangdong, 510631, P.R. China.
⁴ Xidian University School of Computer Science and Technology, Xian 710071, China.

PMID: 37078726
DOI: 10.1093/bfgp/elad012

Abstract

Language models have shown the capacity to learn complex molecular distributions. In the field of molecular generation, they are designed to explore the distribution of molecules, and previous studies have demonstrated their ability to learn molecule sequences. In the early times, recurrent neural networks (RNNs) were widely used for feature extraction from sequence data and have been used for various molecule generation tasks. In recent years, the attention mechanism for sequence data has become popular. It captures the underlying relationships between words and is widely applied to language models. The Transformer-Layer, a model based on a self-attentive mechanism, also shines the same as the RNN-based model. In this research, we investigated the difference between RNNs and the Transformer-Layer to learn a more complex distribution of molecules. For this purpose, we experimented with three different generative tasks: the distributions of molecules with elevated scores of penalized LogP, multimodal distributions of molecules and the largest molecules in PubChem. We evaluated the models on molecular properties, basic metrics, Tanimoto similarity, etc. In addition, we applied two different representations of the molecule, SMILES and SELFIES. The results show that the two language models can learn complex molecular distributions and SMILES-based representation has better performance than SELFIES. The choice between RNNs and the Transformer-Layer needs to be based on the characteristics of dataset. RNNs work better on data focus on local features and decreases with multidistribution data, while the Transformer-Layer is more suitable when meeting molecular with larger weights and focusing on global features.

Keywords: drug generation; language model; molecular; recurrent neural Networks; transformer.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Language*
Neural Networks, Computer*