Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics

Valérian Lupo; Mick Van Vlierberghe; Hervé Vanderschuren; Frédéric Kerff; Denis Baurain; Luc Cornet

doi:10.3389/fmicb.2021.755101

Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics

Front Microbiol. 2021 Oct 22:12:755101. doi: 10.3389/fmicb.2021.755101. eCollection 2021.

Authors

Valérian Lupo^{1

2}, Mick Van Vlierberghe¹, Hervé Vanderschuren³, Frédéric Kerff², Denis Baurain¹, Luc Cornet^{1

3}

Affiliations

¹ InBioS-PhytoSYSTEMS, Eukaryotic Phylogenomics, University of Liège, Liège, Belgium.
² InBioS, Center for Protein Engineering, University of Liège, Liège, Belgium.
³ Plant Genetics, TERRA Teaching and Research Center, Gembloux Agro-Bio Tech, University of Liège, Liège, Belgium.

Abstract

Contaminating sequences in public genome databases is a pervasive issue with potentially far-reaching consequences. This problem has attracted much attention in the recent literature and many different tools are now available to detect contaminants. Although these methods are based on diverse algorithms that can sometimes produce widely different estimates of the contamination level, the majority of genomic studies rely on a single method of detection, which represents a risk of systematic error. In this work, we used two orthogonal methods to assess the level of contamination among National Center for Biotechnological Information Reference Sequence Database (RefSeq) bacterial genomes. First, we applied the most popular solution, CheckM, which is based on gene markers. We then complemented this approach by a genome-wide method, termed Physeter, which now implements a k-folds algorithm to avoid inaccurate detection due to potential contamination of the reference database. We demonstrate that CheckM cannot currently be applied to all available genomes and bacterial groups. While it performed well on the majority of RefSeq genomes, it produced dubious results for 12,326 organisms. Among those, Physeter identified 239 contaminated genomes that had been missed by CheckM. In conclusion, we emphasize the importance of using multiple methods of detection while providing an upgrade of our own detection tool, Physeter, which minimizes incorrect contamination estimates in the context of unavoidably contaminated reference databases.

Keywords: NCBI RefSeq; assembly; contamination; databases; genomes; phylogenomics; sequencing.

Associated data

figshare/10.6084/m9.figshare.13139810