Imputation of missing monthly rainfall data using machine learning and spatial interpolation approaches in Thale Sap Songkhla River Basin, Thailand

Environ Sci Pollut Res Int. 2022 Sep 29. doi: 10.1007/s11356-022-23022-8. Online ahead of print.

Abstract

Missing rainfall data has been a prevalent issue and primarily interested in hydrology and meteorology. This research aimed to examine the capability of machine learning (ML) and spatial interpolation (SI) methods to estimate missing monthly rainfall data. Six ML algorithms (i.e. multiple linear regression (MLR), M5 model tree (M5), random forest (RF), support vector regression (SVR), multilayer perceptron (MLP), genetic programming (GP)) and four SI methods (i.e. arithmetic average (AA), inverse distance weighting (IDW), correlation coefficient weighted (CCW), normal ratio (NR)) were investigated and compared in their performance. The twelve rainfall stations, located in the Thale Sap Songkhla river basin and nearby basins, were considered as a study case. Tuning hyper-parameters for each ML method was conducted to get the most suitable model for the data sets considered. Three performance criteria matrices (i.e. NSE, OI, and r) were chosen, and the sum of those three performance criteria matrices was introduced for methods' performance comparison. The experimental results pointed out that selecting neighbouring stations were essential when applying SI methods, but not for the ML method. The overall performance showed ML better imputed missing monthly rainfall than SI due to overcoming spatial constraints. GP provided the highest performance by giving NSE = 0.825, OI = 0.877, and r = 0.909 for the training stage. Those values for the testing stage were 0.796, 0.852, and 0.902, respectively. It was followed by SVR-rbf, SVR-poly, and RF. NR provided the best performance among four SI methods, followed by CCW, AA, and IDW. When applying SI methods, it should contemplate a correlation between the target and neighbouring stations greater than 0.80.

Keywords: Hyper-parameters; Imputation; Machine learning; Missing rainfall data; Spatial interpolation.