Toward perfect reads: self-correction of short reads via mapping on de Bruijn graphs

Antoine Limasset; Jean-François Flot; Pierre Peterlongo

doi:10.1093/bioinformatics/btz102

Toward perfect reads: self-correction of short reads via mapping on de Bruijn graphs

Bioinformatics. 2020 Mar 1;36(5):1374-1381. doi: 10.1093/bioinformatics/btz102.

Authors

Antoine Limasset¹, Jean-François Flot^{1

2}, Pierre Peterlongo³

Affiliations

¹ Evolutionary Biology & Ecology, Université Libre de Bruxelles (ULB), Bruxelles, Belgium.
² Interuniversity Institute of Bioinformatics in Brussels - (IB) 2, Brussels, Belgium.
³ Inria, CNRS, University of Rennes, IRISA, Rennes, France.

PMID: 30785192
DOI: 10.1093/bioinformatics/btz102

Abstract

Motivation: Short-read accuracy is important for downstream analyses such as genome assembly and hybrid long-read correction. Despite much work on short-read correction, present-day correctors either do not scale well on large datasets or consider reads as mere suites of k-mers, without taking into account their full-length sequence information.

Results: We propose a new method to correct short reads using de Bruijn graphs and implement it as a tool called Bcool. As a first step, Bcool constructs a compacted de Bruijn graph from the reads. This graph is filtered on the basis of k-mer abundance then of unitig abundance, thereby removing most sequencing errors. The cleaned graph is then used as a reference on which the reads are mapped to correct them. We show that this approach yields more accurate reads than k-mer-spectrum correctors while being scalable to human-size genomic datasets and beyond.

Availability and implementation: The implementation is open source, available at http://github.com/Malfoy/BCOOL under the Affero GPL license and as a Bioconda package.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Genome, Human
High-Throughput Nucleotide Sequencing*
Humans
Sequence Analysis, DNA
Software