Repeat-aware evaluation of scaffolding tools

Igor Mandric; Sergey Knyazev; Alex Zelikovsky

doi:10.1093/bioinformatics/bty131

Repeat-aware evaluation of scaffolding tools

Bioinformatics. 2018 Aug 1;34(15):2530-2537. doi: 10.1093/bioinformatics/bty131.

Authors

Igor Mandric¹, Sergey Knyazev¹, Alex Zelikovsky^{1

2}

Affiliations

¹ Department of Computer Science, Georgia State University, Atlanta, GA, USA.
² The laboratory of bioinformatics, I.M. Sechenov First Moscow State Medical University, Moscow, Russia.

Abstract

Summary: Genomic sequences are assembled into a variable, but large number of contigs that should be scaffolded (ordered and oriented) for facilitating comparative or functional analysis. Finding scaffolding is computationally challenging due to misassemblies, inconsistent coverage across the genome and long repeats. An accurate assessment of scaffolding tools should take into account multiple locations of the same contig on the reference scaffolding rather than matching a repeat to a single best location. This makes mapping of inferred scaffoldings onto the reference a computationally challenging problem. This paper formulates the repeat-aware scaffolding evaluation problem, which is to find a mapping of the inferred scaffolding onto the reference maximizing number of correct links and proposes a scalable algorithm capable of handling large whole-genome datasets. Our novel scaffolding validation framework has been applied to assess the most of state-of-the-art scaffolding tools on the representative subset of Genome Assembly Golden-Standard Evaluations (GAGE) datasets and some novel simulated datasets.

Availability and implementation: The source code of this evaluation framework is available at https://github.com/mandricigor/repeat-aware. The documentation is hosted at https://mandricigor.github.io/repeat-aware.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms
Bacteria / genetics
Contig Mapping / methods*
Eukaryota / genetics
Genome*
Genomics / methods
Humans
Repetitive Sequences, Nucleic Acid*
Sequence Analysis, DNA / methods*
Software*

Grants and funding

R01 EB025022/EB/NIBIB NIH HHS/United States