Development of a basin-scale total nitrogen prediction model by integrating clustering and regression methods

Sci Total Environ. 2024 Apr 10:920:170765. doi: 10.1016/j.scitotenv.2024.170765. Epub 2024 Feb 9.

Abstract

Nutrient runoff into rivers caused by human activity has led to global eutrophication issues. The Nakdong River in South Korea is currently facing significant challenges related to eutrophication and harmful algal blooms, underscoring the critical importance of managing total nitrogen (T-N) levels. However, traditional methods of indoor analysis, which depend on sampling, are labor-intensive and face limitations in collecting high-frequency data. Despite advancements in sensor allowing for the measurement of various parameters, sensors still cannot directly measure T-N, necessitating surrogate regression methods. Therefore, we conducted T-N predictions using a water quality dataset collected from 2018 to 2022 at 157 observatories within the Nakdong River basin. To account for the water quality characteristics of each location, we employed a clustering technique to divide the basin and compared a Gaussian mixture model with K-means clustering. Moreover, optimal regressor for each cluster was selected by comparing multiple linear regression (MLR), random forest, and XGBoost. The results showed that forming four clusters via K-means clustering was the most suitable approach and MLR was reasonably accurate for all clusters. Subsequently, recursive feature elimination cross-validation was used to identify suitable parameters for T-N prediction, thus leading to the construction of high-accuracy T-N prediction models. Clustering was useful not only for improving the regressors but also for spatially analyzing the water quality characteristics of the Nakdong River. The MLR model can reveal causal relationships and thus is useful for decision-making. The results of this study revealed that the combination of a simple linear regression model and clustering method can be applied to a wide watershed. The clustering-based regression model showed potential for accurately predicting T-N at the basin level and is expected to contribute to nationwide water quality management through future applications in various fields.

Keywords: Basin-scale; Clustering; Multiple linear regression; Nakdong River; Total nitrogen.