Cross-sectional analysis and data-driven forecasting of confirmed COVID-19 cases

Appl Intell (Dordr). 2022;52(3):3303-3318. doi: 10.1007/s10489-021-02616-8. Epub 2021 Jul 5.

Abstract

The coronavirus disease 2019 (COVID-19) is rapidly becoming one of the leading causes for mortality worldwide. Various models have been built in previous works to study the spread characteristics and trends of the COVID-19 pandemic. Nevertheless, due to the limited information and data source, the understanding of the spread and impact of the COVID-19 pandemic is still restricted. Therefore, within this paper not only daily historical time-series data of COVID-19 have been taken into account during the modeling, but also regional attributes, e.g., geographic and local factors, which may have played an important role on the confirmed COVID-19 cases in certain regions. In this regard, this study then conducts a comprehensive cross-sectional analysis and data-driven forecasting on this pandemic. The critical features, which has the significant influence on the infection rate of COVID-19, is determined by employing XGB (eXtreme Gradient Boosting) algorithm and SHAP (SHapley Additive exPlanation) and the comparison is carried out by utilizing the RF (Random Forest) and LGB (Light Gradient Boosting) models. To forecast the number of confirmed COVID-19 cases more accurately, a Dual-Stage Attention-Based Recurrent Neural Network (DA-RNN) is applied in this paper. This model has better performance than SVR (Support Vector Regression) and the encoder-decoder network on the experimental dataset. And the model performance is evaluated in the light of three statistic metrics, i.e. MAE, RMSE and R 2. Furthermore, this study is expected to serve as meaningful references for the control and prevention of the COVID-19 pandemic.

Keywords: Coronavirus disease 2019 (COVID-19); Dual-stage attention-based recurrent neural network (DA-RNN); SHapley additive exPlanation (SHAP); eXtreme gradient boosting (XGB).