POPs identification using simple low-code machine learning

Lei Xin; Haiying Yu; Sisi Liu; Guang-Guo Ying; Chang-Er Chen

doi:10.1016/j.scitotenv.2024.171143

POPs identification using simple low-code machine learning

Sci Total Environ. 2024 Apr 15:921:171143. doi: 10.1016/j.scitotenv.2024.171143. Epub 2024 Feb 20.

Authors

Lei Xin¹, Haiying Yu², Sisi Liu¹, Guang-Guo Ying¹, Chang-Er Chen³

Affiliations

¹ School of Environment, MOE Key Laboratory of Theoretical Chemistry of Environment, South China Normal University, Guangzhou 510006, China; Environmental Research Institute, Guangdong Provincial Key Laboratory of Chemical Pollution and Environmental Safety, South China Normal University, Guangzhou 510006, China.
² College of Geography and Environmental Sciences, Zhejiang Normal University, Jinhua 321004, China.
³ School of Environment, MOE Key Laboratory of Theoretical Chemistry of Environment, South China Normal University, Guangzhou 510006, China; Environmental Research Institute, Guangdong Provincial Key Laboratory of Chemical Pollution and Environmental Safety, South China Normal University, Guangzhou 510006, China. Electronic address: changer.chen@m.scnu.edu.cn.

PMID: 38387592
DOI: 10.1016/j.scitotenv.2024.171143

Abstract

Effectively identifying persistent organic pollutants (POPs) with extensive organic chemical datasets poses a formidable challenge but is of utmost importance. Leveraging machine learning techniques can enhance this process, but previous models often demanded advanced programming skills and high-end computing resources. In this study, we harnessed the simplicity of PyCaret, a Python-based package, to construct machine-learning models for POP screening based on 2D molecular descriptors. We compared the performance of these models against a deep convolutional neural network (DCNN) model. Utilising minimal Python code, we generated several models that exhibited superior or comparable performance to the DCNN. The most outstanding performer, the Light Gradient Boosting Machine (LGBM), achieved an accuracy of 96.20 %, an AUC of 97.70 %, and an F1 score of 82.58 %. This model outshone the DCNN model. Furthermore, it excelled in identifying POPs within the REACH PBT and compiled industrial chemical lists. Our findings highlight the accessibility and simplicity of PyCaret, requiring only a few lines of code, rendering it suitable for non-computing professionals in environmental sciences. The ability of low code machine learning tools (e.g. PyCaret) to facilitate model comparison and interpretation holds promise, encouraging prompt assessment and management of chemical substances.

Keywords: Chemical management; Classification; Machine learning; Persistent organic pollutants (POPs); PyCaret; Risk assessment.