A modular metagenomics analysis system for integrated multi-step data exploration

Lauren Mak; Braden Tierney; Cynthia Ronkowski; Michael Toomey; Juan Sebastian Andrade Martinez; Sam Zimmerman; Chenlian Fu; Malika Kopbayeva; Anna Noyvert; Brett Farthing; Shuiquan Tang; Christopher Mason; Iman Hajirasouliha

doi:10.1101/2023.04.09.536171

A modular metagenomics analysis system for integrated multi-step data exploration

bioRxiv [Preprint]. 2023 Apr 9:2023.04.09.536171. doi: 10.1101/2023.04.09.536171.

Authors

Lauren Mak^{1

2}, Braden Tierney^{2

3}, Cynthia Ronkowski⁴, Michael Toomey^{1

2}, Juan Sebastian Andrade Martinez^{1

2}, Sam Zimmerman^{5

6}, Chenlian Fu^{1

2}, Malika Kopbayeva⁴, Anna Noyvert⁴, Brett Farthing⁷, Shuiquan Tang⁷, Christopher Mason^{2

3

8

9}, Iman Hajirasouliha^{2

3

10}

Affiliations

¹ Tri-Institutional Computational Biology & Medicine Program, Weill Cornell Medicine of Cornell University, NY, USA and.
² Institute for Computational Biomedicine, Weill Cornell Medicine, New York, NY, 10065, USA and.
³ Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY, 10065, USA and.
⁴ Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA 90089, USA and.
⁵ Section on Pathophysiology and Molecular Pharmacology, Joslin Diabetes Center, Boston, MA, USA and.
⁶ Department of Microbiology, Harvard Medical School, Boston, MA, USA and.
⁷ Zymo Research, 17062 Murphy Ave, Irvine, CA 92614 and.
⁸ WorldQuant Initiative for Quantitative Prediction, Weill Cornell Medicine, New York, NY, USA and.
⁹ The Feil Family Brain and Mind Research Institute, Weill Cornell Medicine, New York, NY, USA and.
¹⁰ Englander Institute for Precision Medicine, The Meyer Cancer Center, Weill Cornell Medicine, NY, USA.

Abstract

Motivation: Computational analysis of large-scale metagenomics sequencing datasets has proved to be both incredibly valuable for extracting isolate-level taxonomic and functional insights from complex microbial communities. However, thanks to an ever-expanding ecosystem of metagenomics-specific algorithms and file formats, designing studies, implementing seamless and scalable end-to-end workflows, and exploring the massive amounts of output data have become studies unto themselves. Furthermore, there is little inter-communication between output data of different analytic purposes, such as short-read classification and metagenome assembled genomes (MAG) reconstruction. One-click pipelines have helped to organize these tools into targeted workflows, but they suffer from general compatibility and maintainability issues.

Results: To address the gap in easily extensible yet robustly distributable metagenomics workflows, we have developed a module-based metagenomics analysis system written in Snakemake, a popular workflow management system, along with a standardized module and working directory architecture. Each module can be run independently or conjointly with a series of others to produce the target data format (ex. short-read preprocessing alone, or short-read preprocessing followed by de novo assembly), and outputs aggregated summary statistics reports and semi-guided Jupyter notebook-based visualizations, The module system is a bioinformatics-optimzied scaffold designed to be rapidly iterated upon by the research community at large.

Availability: The module template as well as the modules described below can be found at https://github.com/MetaSUB-CAMP.

Publication types

Preprint

Grants and funding

R35 GM138152/GM/NIGMS NIH HHS/United States