Is replacing missing values of PM2.5 constituents with estimates using machine learning better for source apportionment than exclusion or median replacement?

Environ Pollut. 2024 May 16:354:124165. doi: 10.1016/j.envpol.2024.124165. Online ahead of print.

Abstract

East Asian countries have been conducting source apportionment of fine particulate matter (PM2.5) by applying positive matrix factorization (PMF) to hourly constituent concentrations. However, some of the constituent data from the supersites in South Korea was missing due to instrument maintenance and calibration. Conventional preprocessing of missing values, such as exclusion or median replacement, causes biases in the estimated source contributions by changing the PMF input. Machine learning (ML) can estimate the missing values by training on constituent data, meteorological data, and gaseous pollutants. Complete data from the Seoul Supersite in 2018 was taken, and a random 20% was set as missing. PMF was performed by replacing missing values with estimates. Percent errors of the source contributions were calculated compared to those estimated from complete data. Missing values were estimated using a random forest analysis. Estimation accuracy (r2) was as high as 0.874 for missing carbon species and low at 0.631 when ionic species and trace elements were missing. For the seven highest contributing sources, replacing the missing values of carbon species with estimates minimized the percent errors to 2.0% on average. However, replacing the missing values of the other chemical species with estimates increased the percent errors to more than 9.7% on average. Percent errors were maximal at 37% on average when missing values of ionic species and trace elements were replaced with estimates. Missing values, except for carbon species, need to be excluded. This approach reduced the percent errors to 7.4% on average, which was lower than those due to median replacement. Our results show that reducing the biases in source apportionment is possible by replacing the missing values of carbon species with estimates. To improve the biases due to missing values of the other chemical species, the estimation accuracy of the ML needs to be improved.

Keywords: Machine learning; Missing value estimation; PM(2.5) constituents; Positive matrix factorization (PMF); Random forest; Source apportionment.