Input variable selection and calibration data selection for storm water quality regression models

Water Sci Technol. 2013;68(1):50-8. doi: 10.2166/wst.2013.222.

Abstract

Storm water quality models are useful tools in storm water management. Interest has been growing in analyzing existing data for developing models for urban storm water quality evaluations. It is important to select appropriate model inputs when many candidate explanatory variables are available. Model calibration and verification are essential steps in any storm water quality modeling. This study investigates input variable selection and calibration data selection in storm water quality regression models. The two selection problems are mutually interacted. A procedure is developed in order to fulfil the two selection tasks in order. The procedure firstly selects model input variables using a cross validation method. An appropriate number of variables are identified as model inputs to ensure that a model is neither overfitted nor underfitted. Based on the model input selection results, calibration data selection is studied. Uncertainty of model performances due to calibration data selection is investigated with a random selection method. An approach using the cluster method is applied in order to enhance model calibration practice based on the principle of selecting representative data for calibration. The comparison between results from the cluster selection method and random selection shows that the former can significantly improve performances of calibrated models. It is found that the information content in calibration data is important in addition to the size of calibration data.

MeSH terms

  • Environmental Monitoring / methods*
  • Models, Theoretical*
  • Rain
  • Water Quality