Large-scale assessment of PFAS compounds in drinking water sources using machine learning

Water Res. 2023 Sep 1:243:120307. doi: 10.1016/j.watres.2023.120307. Epub 2023 Jul 4.

Abstract

The monitoring of Per and Polyfluoroalkyl substances (PFAS) in drinking water sources has significantly increased due to their recognition as a major public health concern. This information has been utilized to assess the importance of potential explanatory variables in determining the presence and concentration of PFAS in different regions. Nevertheless, the significance of these variables and the reliability of the methods in regions beyond where they were initially tested is still uncertain. Hence, our research pursues two main objectives: 1) to evaluate the validity of the aforementioned variables and methods for several PFAS species in a different area and 2) to build on existing modeling work; a new PFAS predictive model is introduced which is more reliable in determining the presence and concentration of PFAS at a regional level. To achieve these goals, we reconstructed four state-of-the-art models using a statewide dataset available for Michigan. These models involve spatial regression techniques, classification and regression random forest algorithms, and boosted regression trees. They also include numerous explanatory variables, such as features of local soil and hydrology and the number of nearby contamination sources. Then, we use a Bayesian selection approach to find the most relevant among these variables. Finally, we employ the most relevant covariates to assess PFAS occurrence and estimate their concentration using a novel combination of machine learning algorithms and conditional autoregressive (CAR) modeling. In the first case, PFAS occurrence was assessed with an accuracy comparable to the reconstructed models (>90%) while using significantly fewer variables. In the second case, by maintaining low data requirements, the estimated concentrations of most PFAS compounds were more closely aligned with available observations compared to previous methods, with correlation coefficients ρ > 0.90 and R2 > 0.77.

Keywords: Bayesian; Boosted Regression Trees; Conditional autoregressive; Michigan; PFAS compounds; Random Fores.

MeSH terms

  • Bayes Theorem
  • Drinking Water*
  • Fluorocarbons*
  • Machine Learning
  • Reproducibility of Results

Substances

  • Drinking Water
  • Fluorocarbons