Extracting Predictive Representations from Hundreds of Millions of Molecules

Dong Chen; Jiaxin Zheng; Guo-Wei Wei; Feng Pan

doi:10.1021/acs.jpclett.1c03058

Extracting Predictive Representations from Hundreds of Millions of Molecules

J Phys Chem Lett. 2021 Nov 11;12(44):10793-10801. doi: 10.1021/acs.jpclett.1c03058. Epub 2021 Nov 1.

Authors

Dong Chen^{1

2}, Jiaxin Zheng¹, Guo-Wei Wei^{2

3

4}, Feng Pan¹

Affiliations

¹ School of Advanced Materials, Peking University, Shenzhen Graduate School, Shenzhen, 518055, China.
² Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States.
³ Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan 48824, United States.
⁴ Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan 48824, United States.

Abstract

The construction of appropriate representations remains essential for molecular predictions due to intricate molecular complexity. Additionally, it is often expensive and ethically constrained to generate labeled data for supervised learning in molecular sciences, leading to challenging small and diverse data sets. In this work, we develop a self-supervised learning approach to pretrain models from over 700 million unlabeled molecules in multiple databases. The intrinsic chemical logic learned from this approach enables the extraction of predictive representations from task-specific molecular sequences in a fine-tuned process. To understand the importance of self-supervised learning from unlabeled molecules, we assemble three models with different combinations of databases. Moreover, we propose a protocol based on data traits to automatically select the optimal model for a specific task. To validate the proposed method, we consider 10 benchmarks and 38 virtual screening data sets. Extensive validation indicates that the proposed method shows superb performance.

Grants and funding

R01 GM126189/GM/NIGMS NIH HHS/United States