Efficient data preprocessing, episode classification, and source apportionment of particle number concentrations

Chun-Sheng Liang; Hao Wu; Hai-Yan Li; Qiang Zhang; Zhanqing Li; Ke-Bin He

doi:10.1016/j.scitotenv.2020.140923

Efficient data preprocessing, episode classification, and source apportionment of particle number concentrations

Sci Total Environ. 2020 Nov 20:744:140923. doi: 10.1016/j.scitotenv.2020.140923. Epub 2020 Jul 18.

Authors

Chun-Sheng Liang¹, Hao Wu², Hai-Yan Li³, Qiang Zhang⁴, Zhanqing Li⁵, Ke-Bin He⁶

Affiliations

¹ State Key Joint Laboratory of Environment Simulation and Pollution Control, School of Environment, Tsinghua University, Beijing 100084, China; State Environmental Protection Key Laboratory of Sources and Control of Air Pollution Complex, Beijing 100084, China.
² College of Global Change and Earth System Science, Beijing Normal University, Beijing 100875, China; China Global Atmosphere Watch Baseline Observatory (WMO/GAW Station), Xining 810001, China.
³ State Key Joint Laboratory of Environment Simulation and Pollution Control, School of Environment, Tsinghua University, Beijing 100084, China; Institute for Atmospheric and Earth System Research/Physics, Faculty of Science, University of Helsinki, Helsinki 00014, Finland.
⁴ Ministry of Education Key Laboratory for Earth System Modeling, Department of Earth System Science, Tsinghua University, Beijing 100084, China.
⁵ Department of Atmospheric and Oceanic Science, University of Maryland, College Park, MD 20742, USA. Electronic address: zli@atmos.umd.edu.
⁶ State Key Joint Laboratory of Environment Simulation and Pollution Control, School of Environment, Tsinghua University, Beijing 100084, China; State Environmental Protection Key Laboratory of Sources and Control of Air Pollution Complex, Beijing 100084, China. Electronic address: hekb@tsinghua.edu.cn.

PMID: 32755782
DOI: 10.1016/j.scitotenv.2020.140923

Abstract

Number concentration is an important index to measure atmospheric particle pollution. However, tailored methods for data preprocessing and characteristic and source analyses of particle number concentrations (PNC) are rare and interpreting the data is time-consuming and inefficient. In this method-oriented study, we develop and investigate some techniques via flexible conditions, C++ optimized algorithms, and parallel computing in R (an open source software for statistics and graphics) to tackle these challenges. The data preprocessing methods include deletions of variables and observations, outlier removal, and interpolation for missing values (NA). They do better in cleaning data and keeping samples and generate no new outliers after interpolation, compared with previous methods. Besides, automatic division of PNC pollution events based on relative values suites PNC properties and highlights the pollution characteristics related to sources and mechanisms. Additionally, basic functions of k-means clustering, Principal Component Analysis (PCA), Factor Analysis (FA), Positive Matrix Factorization (PMF), and a newly-introduced model NMF (Non-negative Matrix Factorization) were tested and compared in analyzing PNC sources. Only PMF and NMF can identify coal heating and produce more explicable results, meanwhile NMF apportions more distinctly and runs 11-28 times faster than PMF. Traffic is interannually stable in non-heating periods and always dominant. Coal heating's contribution has decreased by 40%-86% in recent 5 heating periods, reflecting the effectiveness of coal burning control.

Keywords: Data preprocessing; Episode classification; Number concentration; Particle pollution; Source apportionment.