Single-step retrosynthesis prediction by leveraging commonly preserved substructures

Nat Commun. 2023 Apr 28;14(1):2446. doi: 10.1038/s41467-023-37969-w.

Abstract

Retrosynthesis analysis is an important task in organic chemistry with numerous industrial applications. Previously, machine learning approaches employing natural language processing techniques achieved promising results in this task by first representing reactant molecules as strings and subsequently predicting reactant molecules using text generation or machine translation models. Chemists cannot readily derive useful insights from traditional approaches that rely largely on atom-level decoding in the string representations, because human experts tend to interpret reactions by analyzing substructures that comprise a molecule. It is well-established that some substructures are stable and remain unchanged in reactions. In this paper, we developed a substructure-level decoding model, where commonly preserved portions of product molecules were automatically extracted with a fully data-driven approach. Our model achieves improvement over previously reported models, and we demonstrate that its performance can be boosted further by enhancing the accuracy of these substructures. Analyzing substructures extracted from our machine learning model can provide human experts with additional insights to assist decision-making in retrosynthesis analysis.