POPs identification using simple low-code machine learning

Sci Total Environ. 2024 Apr 15:921:171143. doi: 10.1016/j.scitotenv.2024.171143. Epub 2024 Feb 20.

Abstract

Effectively identifying persistent organic pollutants (POPs) with extensive organic chemical datasets poses a formidable challenge but is of utmost importance. Leveraging machine learning techniques can enhance this process, but previous models often demanded advanced programming skills and high-end computing resources. In this study, we harnessed the simplicity of PyCaret, a Python-based package, to construct machine-learning models for POP screening based on 2D molecular descriptors. We compared the performance of these models against a deep convolutional neural network (DCNN) model. Utilising minimal Python code, we generated several models that exhibited superior or comparable performance to the DCNN. The most outstanding performer, the Light Gradient Boosting Machine (LGBM), achieved an accuracy of 96.20 %, an AUC of 97.70 %, and an F1 score of 82.58 %. This model outshone the DCNN model. Furthermore, it excelled in identifying POPs within the REACH PBT and compiled industrial chemical lists. Our findings highlight the accessibility and simplicity of PyCaret, requiring only a few lines of code, rendering it suitable for non-computing professionals in environmental sciences. The ability of low code machine learning tools (e.g. PyCaret) to facilitate model comparison and interpretation holds promise, encouraging prompt assessment and management of chemical substances.

Keywords: Chemical management; Classification; Machine learning; Persistent organic pollutants (POPs); PyCaret; Risk assessment.