Machine Learning Models for Predicting the Occurrence of Respiratory Diseases Using Climatic and Air-Pollution Factors

Clin Exp Otorhinolaryngol. 2022 May;15(2):168-176. doi: 10.21053/ceo.2021.01536. Epub 2022 Jan 7.

Abstract

Objectives: Because climatic and air-pollution factors are known to influence the occurrence of respiratory diseases, we used these factors to develop machine learning models for predicting the occurrence of respiratory diseases.

Methods: We obtained the daily number of respiratory disease patients in Seoul. We used climatic and air-pollution factors to predict the daily number of patients treated for respiratory diseases per 10,000 inhabitants. We applied the relief-based feature selection algorithm to evaluate the importance of feature selection. We used the gradient boosting and Gaussian process regression (GPR) methods, respectively, to develop two different prediction models. We also employed the holdout cross-validation method, in which 75% of the data was used to train the model, and the remaining 25% was used to test the trained model. We determined the estimated number of respiratory disease patients by applying the developed prediction models to the test set. To evaluate the performance of each model, we calculated the coefficient of determination (R2) and the root mean square error (RMSE) between the original and estimated numbers of respiratory disease patients. We used the Shapley Additive exPlanations (SHAP) approach to interpret the estimated output of each machine learning model.

Results: Features with negative weights in the relief-based algorithm were excluded. When applying gradient boosting to unseen test data, R2 and RMSE were 0.68 and 13.8, respectively. For GPR, the R2 and RMSE were 0.67 and 13.9, respectively. SHAP analysis showed that reductions in average temperature, daylight duration, average humidity, sulfur dioxide (SO2), total solar insolation amount, and temperature difference increased the number of respiratory disease patients, whereas increases in atmospheric pressure, carbon monoxide (CO), and particulate matter ≤2.5 μm in aerodynamic diameter (PM2.5) increased the number of respiratory disease patients.

Conclusion: We successfully developed models for predicting the occurrence of respiratory diseases using climatic and air-pollution factors. These models could evolve into public warning systems.

Keywords: Air Pollution; Climate; Gaussian Process Regression; Gradient Boosting; Machine Learning; Respiratory Diseases.