The reproducibility of COVID-19 data analysis: paradoxes, pitfalls, and future challenges

Clelia Di Serio; Antonio Malgaroli; Paolo Ferrari; Ron S Kenett

doi:10.1093/pnasnexus/pgac125

The reproducibility of COVID-19 data analysis: paradoxes, pitfalls, and future challenges

PNAS Nexus. 2022 Aug 23;1(3):pgac125. doi: 10.1093/pnasnexus/pgac125. eCollection 2022 Jul.

Authors

Clelia Di Serio^{1

2

3}, Antonio Malgaroli¹, Paolo Ferrari^{3

4

5}, Ron S Kenett^{6

7}

Affiliations

¹ Vita-Salute San Raffaele University, UniSR, Milan, Italy.
² University Centre of Statistics in the Biomedical Sciences CUSSB, UniSR, Milan, Italy.
³ Biomedical Faculty, Università della Svizzera Italiana, Lugano, Switzerland.
⁴ Ente Ospedaliero Cantonale, Lugano, Switzerland.
⁵ Clinical School, University of New South Wales, Sydney, Australia.
⁶ KPA,Samuel Neaman Institute, Technion, Haifa, Israel.
⁷ University of Turin, Turin, Italy.

Abstract

In the midst of the COVID-19 experience, we learned an important scientific lesson: knowledge acquisition and information quality in medicine depends more on "data quality" rather than "data quantity." The large number of COVID-19 reports, published in a very short time, demonstrated that the most advanced statistical and computational tools cannot properly overcome the poor quality of acquired data. The main evidence for this observation comes from the poor reproducibility of results. Indeed, understanding the data generation process is fundamental when investigating scientific questions such as prevalence, immunity, transmissibility, and susceptibility. Most of COVID-19 studies are case reports based on "non probability" sampling and do not adhere to the general principles of controlled experimental designs. Such collected data suffers from many limitations when used to derive clinical conclusions. These include confounding factors, measurement errors and bias selection effects. Each of these elements represents a source of uncertainty, which is often ignored or assumed to provide an unbiased random contribution. Inference retrieved from large data in medicine is also affected by data protection policies that, while protecting patients' privacy, are likely to reduce consistently usefulness of big data in achieving fundamental goals such as effective and efficient data-integration. This limits the degree of generalizability of scientific studies and leads to paradoxical and conflicting conclusions. We provide such examples from assessing the role of risks factors. In conclusion, new paradigms and new designs schemes are needed in order to reach inferential conclusions that are meaningful and informative when dealing with data collected during emergencies like COVID-19.