NGSPanPipe: A Pipeline for Pan-genome Identification in Microbial Strains from Experimental Reads

Umay Kulsum; Arti Kapil; Harpreet Singh; Punit Kaur

doi:10.1007/978-981-10-7572-8_4

NGSPanPipe: A Pipeline for Pan-genome Identification in Microbial Strains from Experimental Reads

Adv Exp Med Biol. 2018:1052:39-49. doi: 10.1007/978-981-10-7572-8_4.

Authors

Umay Kulsum¹, Arti Kapil², Harpreet Singh³, Punit Kaur⁴

Affiliations

¹ Department of Biophysics, All India Institute of Medical Sciences, New Delhi, India.
² Department of Microbiology, All India Institute of Medical Sciences, New Delhi, India.
³ Indian Council of Medical Research, New Delhi, 110029, India.
⁴ Department of Biophysics, All India Institute of Medical Sciences, New Delhi, India. kaurpunit@gmail.com.

PMID: 29785479
DOI: 10.1007/978-981-10-7572-8_4

Abstract

Recent advancements in sequencing technologies have decreased both time span and cost for sequencing the whole bacterial genome. High-throughput Next-Generation Sequencing (NGS) technology has led to the generation of enormous data concerning microbial populations publically available across various repositories. As a consequence, it has become possible to study and compare the genomes of different bacterial strains within a species or genus in terms of evolution, ecology and diversity. Studying the pan-genome provides insights into deciphering microevolution, global composition and diversity in virulence and pathogenesis of a species. It can also assist in identifying drug targets and proposing vaccine candidates. The effective analysis of these large genome datasets necessitates the development of robust tools. Current methods to develop pan-genome do not support direct input of raw reads from the sequencer machine but require preprocessing of reads as an assembled protein/gene sequence file or the binary matrix of orthologous genes/proteins. We have designed an easy-to-use integrated pipeline, NGSPanPipe, which can directly identify the pan-genome from short reads. The output from the pipeline is compatible with other pan-genome analysis tools. We evaluated our pipeline with other methods for developing pan-genome, i.e. reference-based assembly and de novo assembly using simulated reads of Mycobacterium tuberculosis. The single script pipeline (pipeline.pl) is applicable for all bacterial strains. It integrates multiple in-house Perl scripts and is freely accessible from https://github.com/Biomedinformatics/NGSPanPipe .

Keywords: Accessory genome; Bacterial species; Core genome; Next-generation sequencing; Pan-genome; Short reads.

Publication types

Research Support, Non-U.S. Gov't
Review

MeSH terms

Bacteria / classification
Bacteria / genetics*
Bacteria / isolation & purification
Databases, Genetic
Genome, Bacterial*
High-Throughput Nucleotide Sequencing