The importance of data splitting in combined NOx concentration modelling

Sci Total Environ. 2023 Apr 10:868:161744. doi: 10.1016/j.scitotenv.2023.161744. Epub 2023 Jan 21.

Abstract

The polluted air breathed every day by those living in large conurbations poses a significant risk to their health. Through effective modelling (prediction) of concentrations of pollutants and identification of the factors influencing them, it should be possible to obtain advance information on dangers and to plan and implement measures to reduce them. This work describes two different modelling approaches: based on the NOx concentration of the previous hour (C&RT models); and based on meteorological factors, traffic flow, and past (up to two previous hours) NOx and NO2 concentrations (CA models). For each approach, three alternative machine learning methods were applied: artificial neutral network (ANN), random forest (RF), and support vector regression (SVR). The best fits were obtained for the models using ANN and RF (MAPE values in the range 18.3-18.5 %). Poorer fits were found for the SVR models (MAPE equal to 23.4 % for the C&RT approach and 29.3 % for CA). No significant preferences were identified between the C&RT and CA approaches (based on various goodness-of-fit measures). The choice should be determined by the purposes for which the forecast is to be used.

Keywords: Air pollution modelling; Artificial neural networks; Machine learning; NO(x); Random forest; Splitting.