Improving prediction of water quality indices using novel hybrid machine-learning algorithms

Duie Tien Bui; Khabat Khosravi; John Tiefenbacher; Hoang Nguyen; Nerantzis Kazakis

doi:10.1016/j.scitotenv.2020.137612

Improving prediction of water quality indices using novel hybrid machine-learning algorithms

Sci Total Environ. 2020 Jun 15:721:137612. doi: 10.1016/j.scitotenv.2020.137612. Epub 2020 Mar 3.

Authors

Duie Tien Bui¹, Khabat Khosravi², John Tiefenbacher³, Hoang Nguyen⁴, Nerantzis Kazakis⁵

Affiliations

¹ Geographic Information Science Research Group, Ton Duc Thang University, Ho Chi Minh City, Viet Nam; Faculty of Environment and Labour Safety, Ton Duc Thang University, Ho Chi Minh City, Viet Nam. Electronic address: buitiendieu@tdtu.edu.vn.
² School of Engineering, University of Guelph, Guelph, Canada. Electronic address: kkhosrav@uoguelph.ca.
³ Department of Geography, Texas State University, San Marcos, TX 78666, USA. Electronic address: tief@txstate.edu.
⁴ Institute of Research and Development, Duy Tan University, Da Nang 550000, Viet Nam. Electronic address: nguyenhoang23@duytan.edu.vn.
⁵ Aristotle University of Thessaloniki, Department of Geology, Lab. of Engineering Geology & Hydrogeology, 54124 Thessaloniki, Greece. Electronic address: kazakis@geo.auth.gr.

PMID: 32169637
DOI: 10.1016/j.scitotenv.2020.137612

Abstract

River water quality assessment is one of the most important tasks to enhance water resources management plans. A water quality index (WQI) considers several water quality variables simultaneously. Traditionally WQI calculations consume time and are often fraught with errors during derivations of sub-indices. In this study, 4 standalone (random forest (RF), M5P, random tree (RT), and reduced error pruning tree (REPT)) and 12 hybrid data-mining algorithms (combinations of standalones with bagging (BA), CV parameter selection (CVPS) and randomizable filtered classification (RFC)) were used to create Iran WQI (IRWQI_sc) predictions. Six years (2012 to 2018) of monthly data from two water quality monitoring stations within the Talar catchment were compiled. Using Pearson correlation coefficients, 10 different input combinations were constructed. The data were divided into two groups (ratio 70:30) for model building (training dataset) and model validation (testing dataset) using a 10-fold cross-validation technique. The models were evaluated using several statistical and visual evaluation metrics. Result show that fecal coliform (FC) and total solids (TS) had the greatest and least effect on the prediction of IRWQI_sc. The best input combinations varied among the algorithms; generally variables with very low correlations displayed weaker performance. Hybrid algorithms improved the prediction power of several of the standalone models, but not all. Hybrid BA-RT outperformed the other models (R² = 0.941, RMSE = 2.71, MAE = 1.87, NSE = 0.941, PBIAS = 0.500). PBIAS indicated that all algorithms, with the exceptions of RT, BA-RT and CVPS-REPT, overestimated WQI values.

Keywords: Data mining; Novel hybrid algorithms; Prediction; Water quality index.