Water quality prediction based on sparse dataset using enhanced machine learning

Sheng Huang; Jun Xia; Yueling Wang; Jiarui Lei; Gangsheng Wang

doi:10.1016/j.ese.2024.100402

Water quality prediction based on sparse dataset using enhanced machine learning

Environ Sci Ecotechnol. 2024 Mar 1:20:100402. doi: 10.1016/j.ese.2024.100402. eCollection 2024 Jul.

Authors

Sheng Huang^{1

2

3}, Jun Xia^{1

2

4}, Yueling Wang⁴, Jiarui Lei³, Gangsheng Wang^{1

2}

Affiliations

¹ State Key Laboratory of Water Resources Engineering and Management, Wuhan University, Wuhan 430072, China.
² Institute for Water-Carbon Cycles and Carbon Neutrality, Wuhan University, Wuhan 430072, China.
³ Department of Civil and Environmental Engineering, National University of Singapore, 117578 Singapore.
⁴ Key Laboratory of Water Cycle and Related Land Surface Processes, Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing 100101, China.

Abstract

Water quality in surface bodies remains a pressing issue worldwide. While some regions have rich water quality data, less attention is given to areas that lack sufficient data. Therefore, it is crucial to explore novel ways of managing source-oriented surface water pollution in scenarios with infrequent data collection such as weekly or monthly. Here we showed sparse-dataset-based prediction of water pollution using machine learning. We investigated the efficacy of a traditional Recurrent Neural Network alongside three Long Short-Term Memory (LSTM) models, integrated with the Load Estimator (LOADEST). The research was conducted at a river-lake confluence, an area with intricate hydrological patterns. We found that the Self-Attentive LSTM (SA-LSTM) model outperformed the other three machine learning models in predicting water quality, achieving Nash-Sutcliffe Efficiency (NSE) scores of 0.71 for COD_Mn and 0.57 for NH₃N when utilizing LOADEST-augmented water quality data (referred to as the SA-LSTM-LOADEST model). The SA-LSTM-LOADEST model improved upon the standalone SA-LSTM model by reducing the Root Mean Square Error (RMSE) by 24.6% for COD_Mn and 21.3% for NH₃N. Furthermore, the model maintained its predictive accuracy when data collection intervals were extended from weekly to monthly. Additionally, the SA-LSTM-LOADEST model demonstrated the capability to forecast pollution loads up to ten days in advance. This study shows promise for improving water quality modeling in regions with limited monitoring capabilities.

Keywords: Load estimator; Long short-term memory; Machine learning; River-lake confluence; Sparse measurement; Water quality modeling.