SnakeLines: integrated set of computational pipelines for sequencing reads

Jaroslav Budiš; Werner Krampl; Marcel Kucharík; Rastislav Hekel; Adrián Goga; Jozef Sitarčík; Michal Lichvár; Dávid Smol'ak; Miroslav Böhmer; Andrej Baláž; František Ďuriš; Juraj Gazdarica; Katarína Šoltys; Ján Turňa; Ján Radvánszky; Tomáš Szemes

doi:10.1515/jib-2022-0059

SnakeLines: integrated set of computational pipelines for sequencing reads

J Integr Bioinform. 2023 Aug 21;20(3):20220059. doi: 10.1515/jib-2022-0059. eCollection 2023 Sep 1.

Authors

Jaroslav Budiš^{1

2

3}, Werner Krampl^{1

3

4}, Marcel Kucharík^{1

3}, Rastislav Hekel^{1

2

4}, Adrián Goga^{3

5}, Jozef Sitarčík^{1

2

3}, Michal Lichvár^{1

3}, Dávid Smol'ak^{1

4}, Miroslav Böhmer^{1

3

4}, Andrej Baláž^{1

6}, František Ďuriš^{1

2}, Juraj Gazdarica^{1

2}, Katarína Šoltys^{3

4}, Ján Turňa^{2

3

4}, Ján Radvánszky^{1

3

7}, Tomáš Szemes^{1

3

4}

Affiliations

¹ Geneton Ltd., 841 04 Bratislava, Slovakia.
² Slovak Centre of Scientific and Technical Information, 811 04 Bratislava, Slovakia.
³ Comenius University Science Park, 841 04 Bratislava, Slovakia.
⁴ Department of Molecular Biology, Faculty of Natural Sciences, Comenius University, 841 04 Bratislava, Slovakia.
⁵ Department of Computer Science, Faculty of Mathematics, Physics and Informatics, Comenius University, 841 04 Bratislava, Slovakia.
⁶ Department of Applied Informatics, Faculty of Mathematics, Physics and Informatics, Comenius University, 841 04 Bratislava, Slovakia.
⁷ Institute of Clinical and Translational Research, Biomedical Research Center, Slovak Academy of Sciences, 845 05 Bratislava, Slovakia.

Abstract

With the rapid growth of massively parallel sequencing technologies, still more laboratories are utilising sequenced DNA fragments for genomic analyses. Interpretation of sequencing data is, however, strongly dependent on bioinformatics processing, which is often too demanding for clinicians and researchers without a computational background. Another problem represents the reproducibility of computational analyses across separated computational centres with inconsistent versions of installed libraries and bioinformatics tools. We propose an easily extensible set of computational pipelines, called SnakeLines, for processing sequencing reads; including mapping, assembly, variant calling, viral identification, transcriptomics, and metagenomics analysis. Individual steps of an analysis, along with methods and their parameters can be readily modified in a single configuration file. Provided pipelines are embedded in virtual environments that ensure isolation of required resources from the host operating system, rapid deployment, and reproducibility of analysis across different Unix-based platforms. SnakeLines is a powerful framework for the automation of bioinformatics analyses, with emphasis on a simple set-up, modifications, extensibility, and reproducibility. The framework is already routinely used in various research projects and their applications, especially in the Slovak national surveillance of SARS-CoV-2.

Keywords: computational pipeline; framework; massively parallel sequencing; reproducibility; virtual environment.

MeSH terms

Computational Biology / methods
Genomics* / methods
High-Throughput Nucleotide Sequencing / methods
Reproducibility of Results
Software*