Novel application of a statistical technique, Random Forests, in a bacterial source tracking study

Water Res. 2010 Jul;44(14):4067-76. doi: 10.1016/j.watres.2010.05.019. Epub 2010 May 31.

Abstract

In this study, data from bacterial source tracking (BST) analysis using antibiotic resistance profiles were examined using two statistical techniques, Random Forests (RF) and discriminant analysis (DA) to determine sources of fecal contamination of a Texas water body. Cow Trap and Cedar Lakes are potential oyster harvesting waters located in Brazoria County, Texas, that have been listed as impaired for bacteria on the 2004 Texas 303(d) list. Unknown source Escherichia coli were isolated from water samples collected in the study area during two sampling events. Isolates were confirmed as E. coli using carbon source utilization profiles and then analyzed via ARA, following the Kirby-Bauer disk diffusion method. Zone diameters from ARA profiles were analyzed with both DA and RF. Using a two-way classification (human vs nonhuman), both DA and RF categorized over 90% of the 299 unknown source isolates as a nonhuman source. The average rates of correct classification (ARCCs) for the library of 1172 isolates using DA and RF were 74.6% and 82.3%, respectively. ARCCs from RF ranged from 7.7 to 12.0% higher than those from DA. Rates of correct classification (RCCs) for individual sources classified with RF ranged from 23.2 to 0.2% higher than those of DA, with a mean difference of 9.0%. Additional evidence for the outperformance of DA by RF was found in the comparison of training and test set ARCCs and examination of specific disputed isolates; RF produced higher ARCCs (ranging from 8 to 13% higher) than DA for all 1000 trials (excluding the two-way classification, in which RF outperformed DA 999 out of 1000 times). This is of practical significance for analysis of bacterial source tracking data. Overall, based on both DA and RF results, migratory birds were found to be the source of the largest portion of the unknown E. coli isolates. This study is the first known published application of Random Forests in the field of BST.

Publication types

  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Animals
  • Bacteria / isolation & purification*
  • Discriminant Analysis
  • Drug Resistance, Microbial*
  • Escherichia coli / isolation & purification
  • Feces / microbiology
  • Fresh Water / microbiology
  • Humans
  • Models, Statistical*
  • Ostreidae / microbiology
  • Texas
  • Water Microbiology*
  • Water Pollutants / analysis*

Substances

  • Water Pollutants