Swapping Metagenomics Preprocessing Pipeline Components Offers Speed and Sensitivity Increases

George Armstrong; Cameron Martino; Justin Morris; Behnam Khaleghi; Jaeyoung Kang; Jeff DeReus; Qiyun Zhu; Daniel Roush; Daniel McDonald; Antonio Gonazlez; Justin P Shaffer; Carolina Carpenter; Mehrbod Estaki; Stephen Wandro; Sean Eilert; Ameen Akel; Justin Eno; Ken Curewitz; Austin D Swafford; Niema Moshiri; Tajana Rosing; Rob Knight

doi:10.1128/msystems.01378-21

Swapping Metagenomics Preprocessing Pipeline Components Offers Speed and Sensitivity Increases

mSystems. 2022 Apr 26;7(2):e0137821. doi: 10.1128/msystems.01378-21. Epub 2022 Mar 16.

Authors

George Armstrong^#^{1

2}, Cameron Martino^#^{1

2

3}, Justin Morris^{4

5}, Behnam Khaleghi⁶, Jaeyoung Kang⁵, Jeff DeReus^{1

3}, Qiyun Zhu^{7

8}, Daniel Roush^{7

8}, Daniel McDonald¹, Antonio Gonazlez¹, Justin P Shaffer¹, Carolina Carpenter^{3

9}, Mehrbod Estaki¹, Stephen Wandro³, Sean Eilert¹⁰, Ameen Akel¹⁰, Justin Eno¹⁰, Ken Curewitz¹⁰, Austin D Swafford³, Niema Moshiri⁶, Tajana Rosing^{3

5

6}, Rob Knight^{1

6

11}

Affiliations

¹ Department of Pediatrics, School of Medicine, University of California, San Diegogrid.266100.3, California, USA.
² Bioinformatics and Systems Biology Program, University of California, San Diegogrid.266100.3, California, USA.
³ Center for Microbiome Innovation, Jacobs School of Engineering, University of California San Diego, La Jolla, California, USA.
⁴ Department of Electrical and Computer Engineering, San Diego State University, San Diego, California, USA.
⁵ Department of Electrical and Computer Engineering, Jacobs School of Engineering, University of California San Diego, La Jolla, California, USA.
⁶ Department of Computer Science and Engineering, Jacobs School of Engineering, University of California San Diego, La Jolla, California, USA.
⁷ School of Life Sciences, Arizona State Universitygrid.215654.1, Tempe, Arizona, USA.
⁸ Biodesign Center for Fundamental and Applied Microbiomics, Arizona State Universitygrid.215654.1, Tempe, Arizona, USA.
⁹ Scripps Institution of Oceanography, University of California San Diego, La Jolla, California, USA.
¹⁰ Micron Technology, Inc., Folsom, California, USA.
¹¹ Department of Bioengineering, University of California, San Diegogrid.266100.3, La Jolla, California, USA.

^# Contributed equally.

Abstract

Increasing data volumes on high-throughput sequencing instruments such as the NovaSeq 6000 leads to long computational bottlenecks for common metagenomics data preprocessing tasks such as adaptor and primer trimming and host removal. Here, we test whether faster recently developed computational tools (Fastp and Minimap2) can replace widely used choices (Atropos and Bowtie2), obtaining dramatic accelerations with additional sensitivity and minimal loss of specificity for these tasks. Furthermore, the taxonomic tables resulting from downstream processing provide biologically comparable results. However, we demonstrate that for taxonomic assignment, Bowtie2's specificity is still required. We suggest that periodic reevaluation of pipeline components, together with improvements to standardized APIs to chain them together, will greatly enhance the efficiency of common bioinformatics tasks while also facilitating incorporation of further optimized steps running on GPUs, FPGAs, or other architectures. We also note that a detailed exploration of available algorithms and pipeline components is an important step that should be taken before optimization of less efficient algorithms on advanced or nonstandard hardware. IMPORTANCE In shotgun metagenomics studies that seek to relate changes in microbial DNA across samples, processing the data on a computer often takes longer than obtaining the data from the sequencing instrument. Recently developed software packages that perform individual steps in the pipeline of data processing in principle offer speed advantages, but in practice they may contain pitfalls that prevent their use, for example, they may make approximations that introduce unacceptable errors in the data. Here, we show that differences in choices of these components can speed up overall data processing by 5-fold or more on the same hardware while maintaining a high degree of correctness, greatly reducing the time taken to interpret results. This is an important step for using the data in clinical settings, where the time taken to obtain the results may be critical for guiding treatment.

Keywords: alignment; host filtering; metagenomics.

Publication types

Research Support, U.S. Gov't, Non-P.H.S.
Research Support, N.I.H., Extramural

MeSH terms

Algorithms
Computational Biology / methods
High-Throughput Nucleotide Sequencing / methods
Metagenomics* / methods
Software*

Grants and funding

K12 GM068524/GM/NIGMS NIH HHS/United States