Data mining methodology for obtaining epidemiological data in the context of road transport systems

J Ambient Intell Humaniz Comput. 2023;14(7):9253-9275. doi: 10.1007/s12652-022-04427-2. Epub 2022 Oct 1.

Abstract

Millions of people use public transport systems daily, hence their interest for the epidemiology of respiratory infectious diseases, both from a scientific and a health control point of view. This article presents a methodology for obtaining epidemiological information on these types of diseases in the context of a public road transport system. This epidemiological information is based on an estimation of interactions with risk of infection between users of the public transport system. The methodology is novel in its aim since, to the best of our knowledge, there is no previous study in the context of epidemiology and public transport systems that addresses this challenge. The information is obtained by mining the data generated from trips made by transport users who use contactless cards as a means of payment. Data mining therefore underpins the methodology. One achievement of the methodology is that it is a comprehensive approach, since, starting from a formalisation of the problem based on epidemiological concepts and the transport activity itself, all the necessary steps to obtain the required epidemiological knowledge are described and implemented. This includes the estimation of data that are generally unknown in the context of public transport systems, but that are required to generate the desired results. The outcome is useful epidemiological data based on a complete and reliable description of all estimated potentially infectious interactions between users of the transport system. The methodology can be implemented using a variety of initial specifications: epidemiological, temporal, geographic, inter alia. Another feature of the methodology is that with the information it provides, epidemiological studies can be carried out involving a large number of people, producing large samples of interactions obtained over long periods of time, thereby making it possible to carry out comparative studies. Moreover, a real use case is described, in which the methodology is applied to a road transport system that annually moves around 20 million passengers, in a period that predates the COVID-19 pandemic. The results have made it possible to identify the group of users most exposed to infection, although they are not the largest group. Finally, it is estimated that the application of a seat allocation strategy that minimises the risk of infection reduces the risk by 50%.

Keywords: COVID-19; Contact patterns; Data mining; Intelligent transport systems; Network epidemiology.