Major data analysis errors invalidate cancer microbiome findings

Abraham Gihawi; Yuchen Ge; Jennifer Lu; Daniela Puiu; Amanda Xu; Colin S Cooper; Daniel S Brewer; Mihaela Pertea; Steven L Salzberg

doi:10.1101/2023.07.28.550993

Major data analysis errors invalidate cancer microbiome findings

bioRxiv [Preprint]. 2023 Jul 31:2023.07.28.550993. doi: 10.1101/2023.07.28.550993.

Authors

Abraham Gihawi¹, Yuchen Ge^{2

3}, Jennifer Lu^{2

3}, Daniela Puiu^{2

3}, Amanda Xu², Colin S Cooper¹, Daniel S Brewer^{1

4}, Mihaela Pertea^{2

3

5}, Steven L Salzberg^{2

3

5

6}

Affiliations

¹ Norwich Medical School, University of East Anglia, Norwich, UK.
² Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, USA.
³ Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, USA.
⁴ Earlham Institute, Norwich Research Park, Colney Lane, Norwich, UK.
⁵ Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, USA.
⁶ Department of Biostatistics, Johns Hopkins University, Baltimore, Maryland, USA.

Abstract

We re-analyzed the data from a recent large-scale study that reported strong correlations between microbial organisms and 33 different cancer types, and that created machine learning predictors with near-perfect accuracy at distinguishing among cancers. We found at least two fundamental flaws in the reported data and in the methods: (1) errors in the genome database and the associated computational methods led to millions of false positive findings of bacterial reads across all samples, largely because most of the sequences identified as bacteria were instead human; and (2) errors in transformation of the raw data created an artificial signature, even for microbes with no reads detected, tagging each tumor type with a distinct signal that the machine learning programs then used to create an apparently accurate classifier. Each of these problems invalidates the results, leading to the conclusion that the microbiome-based classifiers for identifying cancer presented in the study are entirely wrong. These flaws have subsequently affected more than a dozen additional published studies that used the same data and whose results are likely invalid as well.

Publication types

Preprint

Abstract

Publication types

Grants and funding