Label-Free Data Mining of Scientific Literature by Unsupervised Syntactic Distance Analysis

Baicheng Zhang; Hengyu Xiao; Guilin Ye; Zhaokun Song; Tiantian Han; Edward Sharman; Man Luo; Aoyuan Cheng; Qing Zhu; Haitao Zhao; Guoqing Zhang; Song Wang; Jun Jiang

doi:10.1021/acs.jpclett.3c03345

Label-Free Data Mining of Scientific Literature by Unsupervised Syntactic Distance Analysis

J Phys Chem Lett. 2024 Jan 11;15(1):212-219. doi: 10.1021/acs.jpclett.3c03345. Epub 2023 Dec 29.

Authors

Affiliations

¹ Key Laboratory of Precision and Intelligent Chemistry, School of Chemistry and Materials Science, University of Science and Technology of China, Hefei, Anhui 230026, China.
² Hefei JiShu Quantum Technology Co. Ltd., Hefei 230026, China.
³ Department of Neurology, University of California, Irvine, California 92697, United States.
⁴ Materials Interfaces Center, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China.

PMID: 38157213
DOI: 10.1021/acs.jpclett.3c03345

Abstract

Label-free data mining can efficiently feed large amounts of data from the vast scientific literature into artificial intelligence (AI) processing systems. Here, we demonstrate an unsupervised syntactic distance analysis (SDA) approach that is capable of mining chemical substances, functions, properties, and operations without annotation. This SDA approach was evaluated in several areas of research from the physical sciences and achieved performance in information mining comparable to that of supervised learning, as shown by its satisfactory scores of 0.62-0.72, 0.60-0.82, and 0.86-0.95 in precision, recall, and accuracy, respectively. We also showcase how our approach can assist robotic chemists programmed to perform research focused on double-perovskite colloidal nanocrystals, gold colloidal nanocrystals, oxygen evolution reaction catalysts, and enzyme-like catalysts by designing materials, formulations, and synthesis parameters based on data mined from 1.1 million literature references.