Imputation methods for addressing missing data in short-term monitoring of air pollutants

Sci Total Environ. 2020 Aug 15:730:139140. doi: 10.1016/j.scitotenv.2020.139140. Epub 2020 May 3.

Abstract

Monitoring of environmental contaminants is a critical part of exposure sciences research and public health practice. Missing data are often encountered when performing short-term monitoring (<24 h) of air pollutants with real-time monitors, especially in resource-limited areas. Approaches for handling consecutive periods of missing and incomplete data in this context remain unclear. Our aim is to evaluate existing imputation methods for handling missing data for real-time monitors operating for short durations. In a current field-study, realtime PM2.5 monitors were placed outside of 20 households and ran for 24-hours. Missing data was simulated in these households at four consecutive periods of missingness (20%, 40%, 60%, 80%). Univariate (Mean, Median, Last Observation Carried Forward, Kalman Filter, Random, Markov) and multivariate time-series (Predictive Mean Matching, Row Mean Method) methods were used to impute missing concentrations, and performance was evaluated using five error metrics (Absolute Bias, Percent Absolute Error in Means, R2 Coefficient of Determination, Root Mean Square Error, Mean Absolute Error). Univariate methods of Markov, random, and mean imputations were the best performing methods that yielded 24-hour mean concentrations with the lowest error and highest R2 values across all levels of missingness. When evaluating error metrics minute-by-minute, Kalman filters, median, and Markov methods performed well at low levels of missingness (20-40%). However, at higher levels of missingness (60-80%), Markov, random, median, and mean imputation performed best on average. Multivariate methods were the worst performing imputation methods across all levels of missingness. Imputation using univariate methods may provide a reasonable solution to addressing missing data for short-term monitoring of air pollutants, especially in resource-limited areas. Further efforts are needed to evaluate imputation methods that are generalizable across a diverse range of study environments.

Keywords: Ambient PM2.5; Imputation; Missing data; Real-time monitoring.