SAMSA: a comprehensive metatranscriptome analysis pipeline

Samuel T Westreich; Ian Korf; David A Mills; Danielle G Lemay

doi:10.1186/s12859-016-1270-8

SAMSA: a comprehensive metatranscriptome analysis pipeline

BMC Bioinformatics. 2016 Sep 29;17(1):399. doi: 10.1186/s12859-016-1270-8.

Authors

Samuel T Westreich^{1

2}, Ian Korf^{1

2}, David A Mills³, Danielle G Lemay⁴

Affiliations

¹ Department of Molecular and Cellular Biology, University of California - Davis, Davis, CA, USA.
² Genome Center, University of California - Davis, Davis, CA, USA.
³ Department of Food Science and Technology, University of California - Davis, Davis, CA, USA.
⁴ Genome Center, University of California - Davis, Davis, CA, USA. dglemay@ucdavis.edu.

Abstract

Background: Although metatranscriptomics-the study of diverse microbial population activity based on RNA-seq data-is rapidly growing in popularity, there are limited options for biologists to analyze this type of data. Current approaches for processing metatranscriptomes rely on restricted databases and a dedicated computing cluster, or metagenome-based approaches that have not been fully evaluated for processing metatranscriptomic datasets. We created a new bioinformatics pipeline, designed specifically for metatranscriptome dataset analysis, which runs in conjunction with Metagenome-RAST (MG-RAST) servers. Designed for use by researchers with relatively little bioinformatics experience, SAMSA offers a breakdown of metatranscriptome transcription activity levels by organism or transcript function, and is fully open source. We used this new tool to evaluate best practices for sequencing stool metatranscriptomes.

Results: Working with the MG-RAST annotation server, we constructed the Simple Annotation of Metatranscriptomes by Sequence Analysis (SAMSA) software package, a complete pipeline for the analysis of gut microbiome data. SAMSA can summarize and evaluate raw annotation results, identifying abundant species and significant functional differences between metatranscriptomes. Using pilot data and simulated subsets, we determined experimental requirements for fecal gut metatranscriptomes. Sequences need to be either long reads (longer than 100 bp) or joined paired-end reads. Each sample needs 40-50 million raw sequences, which can be expected to yield the 5-10 million annotated reads necessary for accurate abundance measures. We also demonstrated that ribosomal RNA depletion does not equally deplete ribosomes from all species within a sample, and remaining rRNA sequences should be discarded. Using publicly available metatranscriptome data in which rRNA was not depleted, we were able to demonstrate that overall organism transcriptional activity can be measured using mRNA counts. We were also able to detect significant differences between control and experimental groups in both organism transcriptional activity and specific cellular functions.

Conclusions: By making this new pipeline publicly available, we have created a powerful new tool for metatranscriptomics research, offering a new method for greater insight into the activity of diverse microbial communities. We further recommend that stool metatranscriptomes be ribodepleted and sequenced in a 100 bp paired end format with a minimum of 40 million reads per sample.

Keywords: Best practices; Big data; Metagenome; Metatranscriptome; Microbiome; Pipeline; RNA-seq; Software package.

Abstract

Grants and funding