Mining Google Trends data for nowcasting and forecasting colorectal cancer (CRC) prevalence

PeerJ Comput Sci. 2023 Oct 4:9:e1518. doi: 10.7717/peerj-cs.1518. eCollection 2023.

Abstract

Background: Colorectal cancer (CRC) is the third most prevalent and second most lethal form of cancer in the world. Consequently, CRC cancer prevalence projections are essential for assessing the future burden of the disease, planning resource allocation, and developing service delivery strategies, as well as for grasping the shifting environment of cancer risk factors. However, unlike cancer incidence and mortality rates, national and international agencies do not routinely issue projections for cancer prevalence. Moreover, the limited or even nonexistent cancer statistics for large portions of the world, along with the high heterogeneity among world nations, further complicate the task of producing timely and accurate CRC prevalence projections. In this situation, population interest, as shown by Internet searches, can be very important for improving cancer statistics and, in the long run, for helping cancer research.

Methods: This study aims to model, nowcast and forecast the CRC prevalence at the global level using a three-step framework that incorporates three well-established univariate statistical and machine-learning models. First, data mining is performed to evaluate the relevancy of Google Trends (GT) data as a surrogate for the number of CRC survivors. The results demonstrate that population web-search interest in the term "colonoscopy" is the most reliable indicator to nowcast CRC disease prevalence. Then, various statistical and machine-learning models, including ARIMA, ETS, and FNNAR, are trained and tested using relevant GT time series. Finally, the updated monthly query series spanning 2004-2022 and the best forecasting model in terms of out-of-sample forecasting ability (i.e., the neural network autoregression) are utilized to generate point forecasts up to 2025.

Results: Results show that the number of people with colorectal cancer will continue to rise over the next 24 months. This in turn emphasizes the urgency for public policies aimed at reducing the population's exposure to the principal modifiable risk factors, such as lifestyle and nutrition. In addition, given the major drop in population interest in CRC during the first wave of the COVID-19 pandemic, the findings suggest that public health authorities should implement measures to increase cancer screening rates during pandemics. This in turn would deliver positive externalities, including the mitigation of the global burden and the enhancement of the quality of official statistics.

Keywords: ARIMA; Automated algorithms; Cancer; Forecast; Google Trends; Neural network autoregression; Nowcast; Public health; Resource allocation; State space.

Grants and funding

The authors received no funding for this work.