A modular metagenomics analysis system for integrated multi-step data exploration

bioRxiv [Preprint]. 2023 Apr 9:2023.04.09.536171. doi: 10.1101/2023.04.09.536171.

Abstract

Motivation: Computational analysis of large-scale metagenomics sequencing datasets has proved to be both incredibly valuable for extracting isolate-level taxonomic and functional insights from complex microbial communities. However, thanks to an ever-expanding ecosystem of metagenomics-specific algorithms and file formats, designing studies, implementing seamless and scalable end-to-end workflows, and exploring the massive amounts of output data have become studies unto themselves. Furthermore, there is little inter-communication between output data of different analytic purposes, such as short-read classification and metagenome assembled genomes (MAG) reconstruction. One-click pipelines have helped to organize these tools into targeted workflows, but they suffer from general compatibility and maintainability issues.

Results: To address the gap in easily extensible yet robustly distributable metagenomics workflows, we have developed a module-based metagenomics analysis system written in Snakemake, a popular workflow management system, along with a standardized module and working directory architecture. Each module can be run independently or conjointly with a series of others to produce the target data format (ex. short-read preprocessing alone, or short-read preprocessing followed by de novo assembly), and outputs aggregated summary statistics reports and semi-guided Jupyter notebook-based visualizations, The module system is a bioinformatics-optimzied scaffold designed to be rapidly iterated upon by the research community at large.

Availability: The module template as well as the modules described below can be found at https://github.com/MetaSUB-CAMP.

Publication types

  • Preprint