Identification the source of fecal contamination for geographically unassociated samples with a statistical classification model based on support vector machine

J Hazard Mater. 2021 Apr 5:407:124821. doi: 10.1016/j.jhazmat.2020.124821. Epub 2020 Dec 11.

Abstract

The bacterial diversity and corresponding biological significance revealed by high-throughput sequencing contribute massive information to source tracking of fecal contamination. The performances of classification models on predicting the fecal source of geographical local and foreign samples were examined herein, by applying support vector machine (SVM) algorithm. Random forest (RF) and Adaboost were applied for comparison as well. Discriminatory sequences were selected from Clostridiale, Bacteroidales, or Lactobacillales bacterial groups using extremely randomized trees (ExtraTrees). 1.51-12.64% of the unique sequences in the original library composed the representative markers, and they contributed 70% of the discrepancies between source microbiomes. The overall accuracy of the SVM model and the RF model on local samples was 96.08% and 98.04%, respectively, higher than that of the Adaboost (90.20%). As for the non-local samples, the SVM assigned most of the fecal samples into the correct category while several false-positive judgments occurred in closely related groups. The results in this paper suggested that the SVM was a time-saving and accurate method for fecal source tracking in contaminated water body with the potential capability of executing tasks based on geographically unassociated samples, and underlined the necessity of qPCR analysis for accurate detection of human source pollution.

Keywords: 16S rRNA; Amplicon sequencing; Fecal source tracking; Machine learning; SVM.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Bacteroidetes
  • Feces
  • Humans
  • RNA, Ribosomal, 16S
  • Support Vector Machine*
  • Water Microbiology
  • Water Pollution* / analysis

Substances

  • RNA, Ribosomal, 16S