Nontargeted virus sequence discovery pipeline and virus clustering for metagenomic data

David Paez-Espino; Georgios A Pavlopoulos; Natalia N Ivanova; Nikos C Kyrpides

doi:10.1038/nprot.2017.063

Nontargeted virus sequence discovery pipeline and virus clustering for metagenomic data

Nat Protoc. 2017 Aug;12(8):1673-1682. doi: 10.1038/nprot.2017.063. Epub 2017 Jul 27.

Authors

David Paez-Espino¹, Georgios A Pavlopoulos¹, Natalia N Ivanova¹, Nikos C Kyrpides¹

Affiliation

¹ Joint Genome Institute, Department of Energy, Walnut Creek, California, USA.

PMID: 28749930
DOI: 10.1038/nprot.2017.063

Abstract

The analysis of large microbiome data sets holds great promise for the delineation of the biological and metabolic functioning of living organisms and their role in the environment. In the midst of this genomic puzzle, viruses, especially those that infect microbial communities, represent a major reservoir of genetic diversity with great impact on biogeochemical cycles and organismal health. Overcoming the limitations associated with virus detection directly from microbiomes can provide key insights into how ecosystem dynamics are modulated. Here, we present a computational protocol for accurate detection and grouping of viral sequences from microbiome samples. Our approach relies on an expanded and curated set of viral protein families used as bait to identify viral sequences directly from metagenomic assemblies. This protocol describes how to use the viral protein families catalog (∼7 h) and recommended filters for the detection of viral contigs in metagenomic samples (∼6 h), and it describes the specific parameters for a nucleotide-sequence-identity-based method of organizing the viral sequences into quasi-species taxonomic-level groups (∼10 min).

MeSH terms

Cluster Analysis
Computational Biology / methods*
Metagenomics / methods*
Viruses / classification*
Viruses / genetics*
Viruses / isolation & purification