An empirical overview of nonlinearity and overfitting in machine learning using COVID-19 data

Chaos Solitons Fractals. 2020 Oct:139:110055. doi: 10.1016/j.chaos.2020.110055. Epub 2020 Jun 30.

Abstract

In this paper, we applied support vector regression to predict the number of COVID-19 cases for the 12 most-affected countries, testing for different structures of nonlinearity using Kernel functions and analyzing the sensitivity of the models' predictive performance to different hyperparameters settings using 3-D interpolated surfaces. In our experiment, the model that incorporates the highest degree of nonlinearity (Gaussian Kernel) had the best in-sample performance, but also yielded the worst out-of-sample predictions, a typical example of overfitting in a machine learning model. On the other hand, the linear Kernel function performed badly in-sample but generated the best out-of-sample forecasts. The findings of this paper provide an empirical assessment of fundamental concepts in data analysis and evidence the need for caution when applying machine learning models to support real-world decision making, notably with respect to the challenges arising from the COVID-19 pandemics.

Keywords: Bias-variance dilemma; Epidemic spreading; Hyperparameters and chaos; Statistical learning; Support vector machine; Time series prediction.