Two novelty learning models developed based on deep cascade forest to address the environmental imbalanced issues: A case study of drinking water quality prediction

Xingguo Chen; Houtao Liu; Fengrui Liu; Tian Huang; Ruqin Shen; Yongfeng Deng; Da Chen

doi:10.1016/j.envpol.2021.118153

Two novelty learning models developed based on deep cascade forest to address the environmental imbalanced issues: A case study of drinking water quality prediction

Environ Pollut. 2021 Dec 15:291:118153. doi: 10.1016/j.envpol.2021.118153. Epub 2021 Sep 11.

Authors

Xingguo Chen¹, Houtao Liu², Fengrui Liu³, Tian Huang², Ruqin Shen⁴, Yongfeng Deng⁵, Da Chen⁴

Affiliations

¹ Jiangsu Key Laboratory of Big Data Security & Intelligent Processing, Nanjing University of Posts and Telecommunications, Nanjing, Jiangsu, 210023, China; State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, Jiangsu, 210023, China.
² Jiangsu Key Laboratory of Big Data Security & Intelligent Processing, Nanjing University of Posts and Telecommunications, Nanjing, Jiangsu, 210023, China.
³ John F. Kennedy School of Government, Harvard University, Cambridge, MA, 02138, USA.
⁴ School of Environment, Guangzhou Key Laboratory of Environmental Exposure and Health, And Guangdong Key Laboratory of Environmental Pollution and Health, Jinan University, Guangzhou, Guangdong, 510632, China.
⁵ School of Environment, Guangzhou Key Laboratory of Environmental Exposure and Health, And Guangdong Key Laboratory of Environmental Pollution and Health, Jinan University, Guangzhou, Guangdong, 510632, China. Electronic address: Yongfengdeng@jnu.edu.cn.

PMID: 34534828
DOI: 10.1016/j.envpol.2021.118153

Abstract

Environmental quality data sets are typically imbalanced, because environmental pollution events are rarely observed in daily life. Prediction of imbalanced data sets is a major challenge in machine learning. Our recent work has shown deep cascade forest (DCF), as a base learning model, is promising to be recommended for environmental quality prediction. Although some traditional models were improved by introducing the cost matrix, little is known about whether cost matrix could enhance the prediction performance of DCF. Additionally, feature extraction is also an important way to potentially improve the model's ability to predict the imbalanced data. Here, we developed two novelty learning models based on DCF: cost-sensitive DCF (CS-DCF) and DCF that combines unsupervised learning models and greedy methods (USM-DCF-G). Subsequently, CS-DCF and USM-DCF-G were successfully verified by an imbalanced drinking water quality data set. Our data presented both CS-DCF and USM-DCF-G show better prediction performance than that of DCF alone did. In particular, USM-DCF-G shows the best performance with the highest F1-score (95.12 ± 2.56%), after feature extraction and selection by using unsupervised learning models and greedy methods. Thus, the two learning models, especially USM-DCF-G, were promising learning models to address environmental imbalanced issues and accurately predict environmental quality.

Keywords: Cost-sensitive; Deep cascade forest; Environmental imbalance issues; Feature extraction; Feature selection.

MeSH terms

Drinking Water*
Forests
Machine Learning
Water Quality

Substances

Drinking Water