Retrieving Good-Quality Salmonella Genomes From the GenBank Database Using a Python Tool, SalmoDEST

Bioinform Biol Insights. 2022 Feb 23:16:11779322221080264. doi: 10.1177/11779322221080264. eCollection 2022.

Abstract

With the advent of next-generation whole-genome sequencing (WGS), the need for good-quality and well-characterised Salmonella genomes has increased over the past years. Good-quality complete genomes are often required for assembly reference mapping or phylogenetic single nucleotide polymorphism (SNP) analysis. Complete genomes or contigs from specific sources or serovars are also searched for clustering analysis or source attribution studies. Therefore, new bioinformatics tools are needed for the extraction of good-quality and well-characterised genomes from public databases. Here, we developed SalmoDEST, an open-source Python tool capable of extracting Salmonella genomes with a coverage higher than 50x and genome length over 4Mb from the GenBank database in the form of complete genomes or contigs, with verification of the serovar to which they belong and identification of the corresponding multi locus sequence type (MLST) profile. To validate the ability to SalmoDEST to screen for and retrieve genomes of good quality, we compared our results for S. Typhi complete genome with those available in the literature and extracted Salmonella genomes from bovine sources strains isolated worldwide. Finally, we provide in this study a list of 239 complete genomes for 123 serovars of Salmonella of high quality. SalmoDEST is a handy and easy-to-use open-source tool to extract complete genomes or contigs that can be routinely used in public health, food safety and research laboratories. SalmoDEST (SALMOnella Download gEnome Serotype sT) is available at https://github.com/I-Guy/SalmoDEST.

Keywords: MLST profile determination; SalmoDEST; Salmonella; complete reference genomes; good-quality genomes; serovar prediction.