Machine Learning-Based Determination of Sampling Depth for Complex Environmental Systems: Case Study with Single-Cell Raman Spectroscopy Data in EBPR Systems

Environ Sci Technol. 2022 Sep 20;56(18):13473-13484. doi: 10.1021/acs.est.1c08768. Epub 2022 Sep 1.

Abstract

Rapid progress in various advanced analytical methods, such as single-cell technologies, enable unprecedented and deeper understanding of microbial ecology beyond the resolution of conventional approaches. A major application challenge exists in the determination of sufficient sample size without sufficient prior knowledge of the community complexity and, the need to balance between statistical power and limited time or resources. This hinders the desired standardization and wider application of these technologies. Here, we proposed, tested and validated a computational sampling size assessment protocol taking advantage of a metric, named kernel divergence. This metric has two advantages: First, it directly compares data set-wise distributional differences with no requirements on human intervention or prior knowledge-based preclassification. Second, minimal assumptions in distribution and sample space are made in data processing to enhance its application domain. This enables test-verified appropriate handling of data sets with both linear and nonlinear relationships. The model was then validated in a case study with Single-cell Raman Spectroscopy (SCRS) phenotyping data sets from eight different enhanced biological phosphorus removal (EBPR) activated sludge communities located across North America. The model allows the determination of sufficient sampling size for any targeted or customized information capture capacity or resolution level. Promised by its flexibility and minimal restriction of input data types, the proposed method is expected to be a standardized approach for sampling size optimization, enabling more comparable and reproducible experiments and analysis on complex environmental samples. Finally, these advantages enable the extension of the capability to other single-cell technologies or environmental applications with data sets exhibiting continuous features.

Keywords: EBPR; machine learning; sample size assessment; single-cell Raman microspectroscopy; single-cell technology.

Publication types

  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Biological Products*
  • Humans
  • Machine Learning
  • Phosphorus* / chemistry
  • Polyphosphates
  • Sewage
  • Spectrum Analysis, Raman

Substances

  • Biological Products
  • Polyphosphates
  • Sewage
  • Phosphorus