Machine learning for predicting halogen radical reactivity toward aqueous organic chemicalsl

Youheng Liang; Xiaoliu Huangfu; Ruixing Huang; Zhenpeng Han; Sisi Wu; Jingrui Wang; Xinlong Long; Jun Ma; Qiang He

doi:10.1016/j.jhazmat.2024.134501

Machine learning for predicting halogen radical reactivity toward aqueous organic chemicalsl

J Hazard Mater. 2024 May 6:472:134501. doi: 10.1016/j.jhazmat.2024.134501. Online ahead of print.

Authors

Youheng Liang¹, Xiaoliu Huangfu², Ruixing Huang¹, Zhenpeng Han¹, Sisi Wu¹, Jingrui Wang¹, Xinlong Long¹, Jun Ma³, Qiang He¹

Affiliations

¹ Key Laboratory of Eco-Environments in Three Gorges Reservoir Region, Ministry of Education, College of Environment, and Ecology, Chongqing University, Chongqing 400044, China.
² Key Laboratory of Eco-Environments in Three Gorges Reservoir Region, Ministry of Education, College of Environment, and Ecology, Chongqing University, Chongqing 400044, China. Electronic address: hfxl-hit@163.com.
³ State Key Laboratory of Urban Water Resources and Environment, School of Municipal and Environmental Engineering, Harbin Institute of Technology, Harbin 150090, China.

PMID: 38735182
DOI: 10.1016/j.jhazmat.2024.134501

Abstract

Rapid advances in machine learning (ML) provide fast, accurate, and widely applicable methods for predicting free radical-mediated organic pollutant reactivity. In this study, the rate constants (logk) of four halogen radicals were predicted using Morgan fingerprint (MF) and Mordred descriptor (MD) in combination with a series of ML models. The findings highlighted that making accurate predictions for various datasets depended on an effective combination of descriptors and algorithms. To further alleviate the challenge of limited sample size, we introduced a data combination strategy that improved prediction accuracy and mitigated overfitting by combining different datasets. The Light Gradient Boosting Machine (LightGBM) with MF and Random Forest (RF) with MD models based on the unified dataset were finally selected as the optimal models. The SHapley Additive exPlanations revealed insights: the MF-LightGBM model successfully captured the influence of electron-withdrawing/donating groups, while autocorrelation, walk count and information content descriptors in the MD-RF model were identified as key features. Furthermore, the important contribution of pH was emphasized. The results of the applicability domain analysis further supported that the developed model can make reliable predictions for query compounds across a broader range. Finally, a practical web application for logk calculations was built.

Keywords: Halogen radical rate constants; Machine learning; Mordred descriptor; Morgan fingerprint; Web application.