Multi-CSAR: a multiple reference-based contig scaffolder using algebraic rearrangements

BMC Syst Biol. 2018 Dec 31;12(Suppl 9):139. doi: 10.1186/s12918-018-0654-y.

Abstract

Background: One of the important steps in the process of assembling a genome sequence from short reads is scaffolding, in which the contigs in a draft genome are ordered and oriented into scaffolds. Currently, several scaffolding tools based on a single reference genome have been developed. However, a single reference genome may not be sufficient alone for a scaffolder to generate correct scaffolds of a target draft genome, especially when the evolutionary relationship between the target and reference genomes is distant or some rearrangements occur between them. This motivates the need to develop scaffolding tools that can order and orient the contigs of the target genome using multiple reference genomes.

Results: In this work, we utilize a heuristic method to develop a new scaffolder called Multi-CSAR that is able to accurately scaffold a target draft genome based on multiple reference genomes, each of which does not need to be complete. Our experimental results on real datasets show that Multi-CSAR outperforms other two multiple reference-based scaffolding tools, Ragout and MeDuSa, in terms of many average metrics, such as sensitivity, precision, F-score, genome coverage, NGA50, scaffold number and running time.

Conclusions: Multi-CSAR is a multiple reference-based scaffolder that can efficiently produce more accurate scaffolds of a target draft genome by referring to multiple complete and/or incomplete genomes of related organisms. Its stand-alone program is available for download at https://github.com/ablab-nthu/Multi-CSAR.

Keywords: Bioinformatics; Contig; Multiple reference genomes; Scaffolding; Sequencing.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Contig Mapping / methods*
  • Heuristics