A comparison of linear regression, regularization, and machine learning algorithms to develop Europe-wide spatial models of fine particles and nitrogen dioxide

Jie Chen; Kees de Hoogh; John Gulliver; Barbara Hoffmann; Ole Hertel; Matthias Ketzel; Mariska Bauwelinck; Aaron van Donkelaar; Ulla A Hvidtfeldt; Klea Katsouyanni; Nicole A H Janssen; Randall V Martin; Evangelia Samoli; Per E Schwartz; Massimo Stafoggia; Tom Bellander; Maciek Strak; Kathrin Wolf; Danielle Vienneau; Roel Vermeulen; Bert Brunekreef; Gerard Hoek

doi:10.1016/j.envint.2019.104934

A comparison of linear regression, regularization, and machine learning algorithms to develop Europe-wide spatial models of fine particles and nitrogen dioxide

Environ Int. 2019 Sep:130:104934. doi: 10.1016/j.envint.2019.104934. Epub 2019 Jun 20.

Authors

Jie Chen¹, Kees de Hoogh², John Gulliver³, Barbara Hoffmann⁴, Ole Hertel⁵, Matthias Ketzel⁶, Mariska Bauwelinck⁷, Aaron van Donkelaar⁸, Ulla A Hvidtfeldt⁹, Klea Katsouyanni¹⁰, Nicole A H Janssen¹¹, Randall V Martin¹², Evangelia Samoli¹³, Per E Schwartz¹⁴, Massimo Stafoggia¹⁵, Tom Bellander¹⁶, Maciek Strak¹⁷, Kathrin Wolf¹⁸, Danielle Vienneau¹⁹, Roel Vermeulen²⁰, Bert Brunekreef²¹, Gerard Hoek²²

Affiliations

¹ Institute for Risk Assessment Sciences (IRAS), Utrecht University, Postbus 80125, 3508 TC, Utrecht, the Netherlands. Electronic address: j.chen1@uu.nl.
² Swiss Tropical and Public Health Institute, Socinstrasse 57, 4051 Basel, Switzerland; University of Basel, Petersplatz 1, Postfach 4001 Basel, Switzerland. Electronic address: c.dehoogh@swisstph.ch.
³ Centre for Environmental Health and Sustainability, School of Geography, Geology and the Environment, University of Leicester, University Road, Leicester LE1 7RH, UK. Electronic address: jg435@leicester.ac.uk.
⁴ Institute for Occupational, Social and Environmental Medicine, Centre for Health and Society, Medical Faculty, Heinrich Heine University Düsseldorf, Universitätsstraße 1, 40225 Düsseldorf, Germany. Electronic address: B.Hoffmann@uni-duesseldorf.de.
⁵ Department of Environmental Science, Aarhus University, P.O. Box 358, Frederiksborgvej 399, 4000 Roskilde, Denmark. Electronic address: oh@envs.au.dk.
⁶ Department of Environmental Science, Aarhus University, P.O. Box 358, Frederiksborgvej 399, 4000 Roskilde, Denmark; Global Centre for Clean Air Research (GCARE), Department of Civil and Environmental Engineering, University of Surrey, Guildford GU2 7XH, UK. Electronic address: mke@envs.au.dk.
⁷ Interface Demography, Department of Sociology, Vrije Universiteit Brussel, Pleinlaan 2, 1050, Brussels, Belgium. Electronic address: mariska.bauwelinck@vub.ac.be.
⁸ Department of Physics and Atmospheric Science, Dalhousie University, B3H 4R2 Halifax, Nova Scotia, Canada. Electronic address: kelaar@Dal.Ca.
⁹ Danish Cancer Society Research Center, Strandboulevarden 49, 2100 Copenhagen, Denmark. Electronic address: ullah@cancer.dk.
¹⁰ Department of Hygiene, Epidemiology and Medical Statistics, Medical School, National and Kapodistrian University of Athens, 75 Mikras Asias Str, 115 27 Athens, Greece; Department Population Health Sciences and Department of Analytical, Environmental and Forensic Sciences, School of Population Health & Environmental Sciences, King's College Strand, London WC2R 2LS, UK. Electronic address: kkatsouy@med.uoa.gr.
¹¹ National Institute for Public Health and the Environment (RIVM), PO Box 1, 3720 BA, Bilthoven, the Netherlands. Electronic address: nicole.janssen@rivm.nl.
¹² Department of Physics and Atmospheric Science, Dalhousie University, B3H 4R2 Halifax, Nova Scotia, Canada; Atomic and Molecular Physics Division, Harvard-Smithsonian Center for Astrophysics, 60 Garden St, Cambridge, MA 02138, USA. Electronic address: Randall.Martin@Dal.Ca.
¹³ Department of Hygiene, Epidemiology and Medical Statistics, Medical School, National and Kapodistrian University of Athens, 75 Mikras Asias Str, 115 27 Athens, Greece. Electronic address: esamoli@med.uoa.gr.
¹⁴ Division of Environmental Medicine, Norwegian Institute of Public Health, PO Box 4404 Nydalen, N-0403 Oslo, Norway. Electronic address: Per.Schwarze@fhi.no.
¹⁵ Department of Epidemiology, Lazio Region Health Service/ASL Roma 1, Via Cristoforo Colombo, 112, 00147, Rome, Italy; Institute of Environmental Medicine, Karolinska Institutet, SE-171 77 Stockholm, Sweden. Electronic address: m.stafoggia@deplazio.it.
¹⁶ Institute of Environmental Medicine, Karolinska Institutet, SE-171 77 Stockholm, Sweden. Electronic address: Tom.Bellander@ki.se.
¹⁷ Institute for Risk Assessment Sciences (IRAS), Utrecht University, Postbus 80125, 3508 TC, Utrecht, the Netherlands. Electronic address: M.M.Strak@uu.nl.
¹⁸ Helmholtz Zentrum München, German Research Center for Environmental Health (GmbH), Institute of Epidemiology, Ingolstädter Landstr. 1, D-85764 Neuherberg, Germany. Electronic address: kathrin.wolf@helmholtz-muenchen.de.
¹⁹ Swiss Tropical and Public Health Institute, Socinstrasse 57, 4051 Basel, Switzerland; University of Basel, Petersplatz 1, Postfach 4001 Basel, Switzerland. Electronic address: danielle.vienneau@swisstph.ch.
²⁰ Institute for Risk Assessment Sciences (IRAS), Utrecht University, Postbus 80125, 3508 TC, Utrecht, the Netherlands; Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Heidelberglaan 100, 3584 CX, Utrecht, the Netherlands. Electronic address: R.C.H.Vermeulen@uu.nl.
²¹ Institute for Risk Assessment Sciences (IRAS), Utrecht University, Postbus 80125, 3508 TC, Utrecht, the Netherlands; Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Heidelberglaan 100, 3584 CX, Utrecht, the Netherlands. Electronic address: B.Brunekreef@uu.nl.
²² Institute for Risk Assessment Sciences (IRAS), Utrecht University, Postbus 80125, 3508 TC, Utrecht, the Netherlands. Electronic address: G.Hoek@uu.nl.

PMID: 31229871
DOI: 10.1016/j.envint.2019.104934

Abstract

Empirical spatial air pollution models have been applied extensively to assess exposure in epidemiological studies with increasingly sophisticated and complex statistical algorithms beyond ordinary linear regression. However, different algorithms have rarely been compared in terms of their predictive ability. This study compared 16 algorithms to predict annual average fine particle (PM_2.5) and nitrogen dioxide (NO₂) concentrations across Europe. The evaluated algorithms included linear stepwise regression, regularization techniques and machine learning methods. Air pollution models were developed based on the 2010 routine monitoring data from the AIRBASE dataset maintained by the European Environmental Agency (543 sites for PM_2.5 and 2399 sites for NO₂), using satellite observations, dispersion model estimates and land use variables as predictors. We compared the models by performing five-fold cross-validation (CV) and by external validation (EV) using annual average concentrations measured at 416 (PM_2.5) and 1396 sites (NO₂) from the ESCAPE study. We further assessed the correlations between predictions by each pair of algorithms at the ESCAPE sites. For PM_2.5, the models performed similarly across algorithms with a mean CV R² of 0.59 and a mean EV R² of 0.53. Generalized boosted machine, random forest and bagging performed best (CV R²~0.63; EV R² 0.58-0.61), while backward stepwise linear regression, support vector regression and artificial neural network performed less well (CV R² 0.48-0.57; EV R² 0.39-0.46). Most of the PM_2.5 model predictions at ESCAPE sites were highly correlated (R² > 0.85, with the exception of predictions from the artificial neural network). For NO₂, the models performed even more similarly across different algorithms, with CV R²s ranging from 0.57 to 0.62, and EV R²s ranging from 0.49 to 0.51. The predicted concentrations from all algorithms at ESCAPE sites were highly correlated (R² > 0.9). For both pollutants, biases were low for all models except the artificial neural network. Dispersion model estimates and satellite observations were two of the most important predictors for PM_2.5 models whilst dispersion model estimates and traffic variables were most important for NO₂ models in all algorithms that allow assessment of the importance of variables. Different statistical algorithms performed similarly when modelling spatial variation in annual average air pollution concentrations using a large number of training sites.

Keywords: Fine particles; Land use regression; Machine learning; Nitrogen dioxide.

Publication types

Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Air Pollutants / analysis*
Air Pollution / analysis
Environmental Monitoring / methods
Europe
Linear Models*
Machine Learning*
Nitrogen Dioxide / analysis*
Particulate Matter / analysis*

Substances

Air Pollutants
Particulate Matter
Nitrogen Dioxide

Grants and funding

MR/S019669/1/MRC_/Medical Research Council/United Kingdom