PanTools: representation, storage and exploration of pan-genomic data

Siavash Sheikhizadeh; M Eric Schranz; Mehmet Akdel; Dick de Ridder; Sandra Smit

doi:10.1093/bioinformatics/btw455

PanTools: representation, storage and exploration of pan-genomic data

Bioinformatics. 2016 Sep 1;32(17):i487-i493. doi: 10.1093/bioinformatics/btw455.

Authors

Siavash Sheikhizadeh¹, M Eric Schranz², Mehmet Akdel¹, Dick de Ridder¹, Sandra Smit¹

Affiliations

¹ Bioinformatics Group, Wageningen University, Droevendaalsesteeg 1, 6708PB, Wageningen, The Netherlands.
² Biosystematics Group, Wageningen University, Droevendaalsesteeg 1, 6708PB, The Netherlands.

PMID: 27587666
DOI: 10.1093/bioinformatics/btw455

Abstract

Motivation: Next-generation sequencing technology is generating a wealth of highly similar genome sequences for many species, paving the way for a transition from single-genome to pan-genome analyses. Accordingly, genomics research is going to switch from reference-centric to pan-genomic approaches. We define the pan-genome as a comprehensive representation of multiple annotated genomes, facilitating analyses on the similarity and divergence of the constituent genomes at the nucleotide, gene and genome structure level. Current pan-genomic approaches do not thoroughly address scalability, functionality and usability.

Results: We introduce a generalized De Bruijn graph as a pan-genome representation, as well as an online algorithm to construct it. This representation is stored in a Neo4j graph database, which makes our approach scalable to large eukaryotic genomes. Besides the construction algorithm, our software package, called PanTools, currently provides functionality for annotating pan-genomes, adding sequences, grouping genes, retrieving gene sequences or genomic regions, reconstructing genomes and comparing and querying pan-genomes. We demonstrate the performance of the tool using datasets of 62 E. coli genomes, 93 yeast genomes and 19 Arabidopsis thaliana genomes.

Availability and implementation: The Java implementation of PanTools is publicly available at http://www.bif.wur.nl

Contact: sandra.smit@wur.nl.

MeSH terms

Algorithms*
Arabidopsis
Computational Biology / methods
Escherichia coli
Genome*
Genome, Bacterial
Genomics
High-Throughput Nucleotide Sequencing*
Humans
Software