Machine learning approach identifies water sample source based on microbial abundance

Water Res. 2021 Jul 1:199:117185. doi: 10.1016/j.watres.2021.117185. Epub 2021 Apr 27.

Abstract

Water quality can change along a river system due to differences in adjacent land use patterns and discharge sources. These variations can induce rapid responses of the aquatic microbial community, which may be an indicator of water quality characteristics. In the current study, we used a random forest model to predict water sample sources from three different river ecosystems along a gradient of anthropogenic disturbance (i.e., less disturbed mountainous area, wastewater discharged urban area, and pesticide and fertilizer applied agricultural area) based on environmental physicochemical indices (PCIs), microbiological indices (MBIs), and their combination. Results showed that among the PCI-based models, using conventional water quality indices as inputs provided markedly better prediction of water sample source than using pharmaceutical and personal care products (PPCPs), and much better prediction than using polycyclic aromatic hydrocarbons (PAHs) and substituted PAHs (SPAHs). Among the MBI-based models, using the abundances of the top 30 bacteria combined with pathogenic antibiotic resistant bacteria (PARB) as inputs achieved the lowest median out-of-bag error rate (9.9%) and increased median kappa coefficient (0.8694), while adding fungal inputs reduced the kappa coefficient. The model based on the top 30 bacteria still showed an advantage compared with models based on PCIs or the combination of PCIs and MBIs. With improvement in sequencing technology and increase in data availability in the future, the proposed method provides an economical, rapid, and reliable way in which to identify water sample sources based on abundance data of microbial communities.

Keywords: Machine learning classification; Microbial abundance; Physicochemical indices; Random forest; Source identification of water samples.

MeSH terms

  • Ecosystem
  • Environmental Monitoring
  • Machine Learning
  • Percutaneous Coronary Intervention*
  • Polycyclic Aromatic Hydrocarbons* / analysis
  • Rivers
  • Wastewater / analysis
  • Water
  • Water Pollutants, Chemical* / analysis

Substances

  • Polycyclic Aromatic Hydrocarbons
  • Waste Water
  • Water Pollutants, Chemical
  • Water