Software pipelines for RNA-Seq, ChIP-Seq and germline variant calling analyses in common workflow language (CWL)

Front Bioinform. 2023 Nov 7:3:1275593. doi: 10.3389/fbinf.2023.1275593. eCollection 2023.

Abstract

Background: Automating data analysis pipelines is a key requirement to ensure reproducibility of results, especially when dealing with large volumes of data. Here we assembled automated pipelines for the analysis of High-throughput Sequencing (HTS) data originating from RNA-Seq, ChIP-Seq and Germline variant calling experiments. We implemented these workflows in Common workflow language (CWL) and evaluated their performance by: i) reproducing the results of two previously published studies on Chronic Lymphocytic Leukemia (CLL), and ii) analyzing whole genome sequencing data from four Genome in a Bottle Consortium (GIAB) samples, comparing the detected variants against their respective golden standard truth sets. Findings: We demonstrated that CWL-implemented workflows clearly achieved high accuracy in reproducing previously published results, discovering significant biomarkers and detecting germline SNP and small INDEL variants. Conclusion: CWL pipelines are characterized by reproducibility and reusability; combined with containerization, they provide the ability to overcome issues of software incompatibility and laborious configuration requirements. In addition, they are flexible and can be used immediately or adapted to the specific needs of an experiment or study. The CWL-based workflows developed in this study, along with version information for all software tools, are publicly available on GitHub (https://github.com/BiodataAnalysisGroup/CWL_HTS_pipelines) under the MIT License. They are suitable for the analysis of short-read (such as Illumina-based) data and constitute an open resource that can facilitate automation, reproducibility and cross-platform compatibility for standard bioinformatic analyses.

Keywords: containers; epigenomics; genomics; reproducibility; reusability; transcriptomics; workflow automation; workflow language.

Grants and funding

The author(s) declare financial support was received for the research, authorship, and/or publication of this article. Funding from the European Regional Development Fund of the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call RESEARCH–CREATE–INNOVATE (GenOptics, project code: T2E1DK-00407). This work was supported by ELIXIR-GR HYPATIA cloud infrastructure. This work was also supported by ELIXIR, the research infrastructure for life-science data.