Achieving unbiased predictions of national-scale groundwater redox conditions via data oversampling and statistical learning

Sci Total Environ. 2020 Feb 25:705:135877. doi: 10.1016/j.scitotenv.2019.135877. Epub 2019 Dec 3.

Abstract

An important policy consideration for integrated land and water management is to understand the spatial distribution of nitrate attenuation in the groundwater system, for which redox condition is the key indicator. This paper proposes a methodology to accommodate the computational demands of large datasets, and presents national-scale predictions of groundwater redox class for New Zealand. Our approach applies statistical learning methods to relate the redox class determined on groundwater samples to spatially varying attributes. The trained model uses these spatial variables to predict redox status in areas without sample data. We assembled the groundwater sample data from regional authority databases, and assigned each sample a redox class. A key achievement was to overcome the influence of sample selection bias on model training via oversampling. We removed additional bias imposed by imbalances in the predictor variables by applying a conditional inference random forest classifier. The unbiased trained model uses eight predictors, and achieves a high validation performance (accuracy 0.81, kappa 0.71), providing good confidence in model predictions. National maps are provided for redox class and probability at specified depths. Feature importance rankings indicate that reducing conditions are associated with poorly-drained soils, and to a lesser extent, high hydrological variability, low elevation, and low-permeability lithology. These conditions are common in New Zealand's coastal and lowland plains, where artificial drainage is required to make land suitable for production. The spatial extent of reduced groundwater increases with depth, suggesting a shallow influence of soil infiltration or mobile organic carbon, and a deeper influence of lithological electron donors. Our model provides unbiased predictions at a scale relevant for environmental policy development and legislation. Identifying where the ecosystem service provided by denitrification can be utilised will enable spatially targeted interventions that can achieve the desired environmental outcome in a more cost-effective manner than non-targeted interventions.

Keywords: Data oversampling; Groundwater; Nitrate; Random forest classifier; Redox status; Statistical learning.