Classification and Regression Machine Learning Models for Predicting Aerobic Ready and Inherent Biodegradation of Organic Chemicals in Water

Environ Sci Technol. 2022 Sep 6;56(17):12755-12764. doi: 10.1021/acs.est.2c01764. Epub 2022 Aug 16.

Abstract

Machine learning (ML) is viewed as a promising tool for the prediction of aerobic biodegradation, one of the most important elimination pathways of organic chemicals from the environment. However, available models only have small datasets (<3200 records), make binary classification predictions, evaluate ready biodegradability, and do not incorporate experimental conditions (e.g., system setup and reaction time). This study addressed all these limitations by first compiling a large database of 12,750 records, considering both ready and inherent biodegradation under different conditions, and then developing regression and classification models using different chemical representations and ML algorithms. The best regression model (R2 = 0.54 and root mean square error of 0.25) and classification model (the prediction accuracy from 85.1%) achieved very good performance. The model interpretation indicated that the models correctly captured the effects of chemical substructures, following the order of C═O > O═C-O > OH > CH3 > halogen > branching > N > 6-member ring. The consideration of chemical speciation based on pKa and α notations did not affect the regression model performance but significantly improved the classification model performance (the accuracy increased to 87.6%). The models also showed large applicability domains and provided reasonable predictions for more than 98% of over 850,000 environmentally relevant chemicals in the Distributed Structure-Searchable Toxicity database. These robust, trustable models were finally made widely accessible through two free online predictors with graphical user interface.

Keywords: CO2 evolution test; DOC die away; EU method C.4; OECD 301; closed bottle test; closed respirometer; inherent biodegradation; ready biodegradation.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Biodegradation, Environmental
  • Machine Learning
  • Organic Chemicals / chemistry
  • Water Pollutants, Chemical* / chemistry
  • Water* / chemistry

Substances

  • Organic Chemicals
  • Water Pollutants, Chemical
  • Water