Relative performance of different data mining techniques for nitrate concentration and load estimation in different type of watersheds

Shiyang Li; Rabin Bhattarai; Richard A Cooke; Siddhartha Verma; Xiangfeng Huang; Momcilo Markus; Laura Christianson

doi:10.1016/j.envpol.2020.114618

Relative performance of different data mining techniques for nitrate concentration and load estimation in different type of watersheds

Environ Pollut. 2020 Aug;263(Pt A):114618. doi: 10.1016/j.envpol.2020.114618. Epub 2020 Apr 17.

Authors

Shiyang Li¹, Rabin Bhattarai², Richard A Cooke³, Siddhartha Verma³, Xiangfeng Huang¹, Momcilo Markus⁴, Laura Christianson⁵

Affiliations

¹ College of Environmental Science and Engineering, State Key Laboratory of Pollution Control and Resource Reuse, Ministry of Education Key Laboratory of Yangtze River Water Environment, Tongji University, Shanghai, 200092, People's Republic of China.
² Department of Agricultural and Biological Engineering, University of Illinois at Urbana Champaign, 1304 W Pennsylvania Ave #338, Urbana, IL, 61801, USA. Electronic address: rbhatta2@illinois.edu.
³ Department of Agricultural and Biological Engineering, University of Illinois at Urbana Champaign, 1304 W Pennsylvania Ave #338, Urbana, IL, 61801, USA.
⁴ Prairie Research Institute, Illinois State Water Survey, 2204 Griffith Dr., Champaign, IL, 61820, USA.
⁵ Department of Crop Sciences, University of Illinois at Urbana Champaign, AW-101 Turner Hall, 1102 South Goodwin Avenue, Urbana, IL, 61801, USA.

PMID: 33618470
DOI: 10.1016/j.envpol.2020.114618

Abstract

The increasing availability of water quality datasets has led to a greater focus on hydrologic and water quality analysis, thus requiring more efficient and accurate modelling methods. Data mining techniques have been increasingly used for water quality analysis and prediction of the concentration and load of nitrogen pollutants instead of more traditional simulation methods. In this study, we tested the multilayer perceptron (MLP), k-nearest neighbor (k-NN), random forest, and reduced error pruning tree (REPTree) methods, along with the traditional linear regression, to predict nitrate levels based on long-term data from six watersheds with different land-use practices in the midwestern United States. Both the concentration and load results indicated that REPTree had the best performance, with an R² of 0.61-0.85 and a relative absolute error of <75.8%. The different watershed types, however, influenced the performance of the data mining methods, where all four methods showed a higher accuracy for urban dominant watershed and lower accuracy for agricultural and forest watersheds. Out of these four methods, classification tree methods (REPTree and RF) performed better than cluster methods (MLP and k-NN) for agricultural and forested watersheds. Our results indicated that both the data structure based on the dominant land use and type of algorithmic method should be carefully considered for selecting a data mining method to predict nitrate concentration and load for a watershed.

Keywords: Data mining; Nitrate concentration; Water pollution; Watershed land use.

MeSH terms

Agriculture*
Data Mining
Environmental Monitoring
Midwestern United States
Nitrates* / analysis
Water Quality

Substances

Nitrates