Construction of a Comprehensive Database from the Existing Viral Sequences Available from the International Nucleotide Sequence Database Collaboration

Methods Mol Biol. 2018:1838:231-243. doi: 10.1007/978-1-4939-8682-8_16.

Abstract

The progress in viromics research has led to the accumulation of a large number of sequences from different types of viruses obtained from different sources. Most databases are specific to different of species or types of viruses. However, raw sequences, as deposited in the reliable online collections, provide a valuable asset in the exploration of genomic and metagenomics datasets.The International Nucleotide Sequence Database Collaboration (INSDC) is the largest coordinated effort for compiling, sharing, and maintaining the most comprehensive collections of nucleic acids deposited throughout the most important public databases. The compendium includes different types of data such as complete genomes, genes, expressed sequence tags, and data generated by whole genome shotgun analyses spanning all domains of life, as well as the most complete collection of viral sequences available online.This chapter presents simplified computational methods for the automation of viral nucleotide sequence retrieval from online repositories of the INSDC databases, including all available sequences, except synthetic ones. The subsequent steps can be used for obtaining the taxonomy (including ranks: virus type, baltimore classification, order, family, subfamily, genus and species), and split the database into species subsets to dereplicate the sequences for other downstream applications. Only basic computational knowledge is required.

Keywords: Computational methods; Databases; Genes; Genomes; Taxonomy; Viruses.

MeSH terms

  • Computational Biology / methods*
  • Databases, Nucleic Acid*
  • Genome, Viral*
  • Metagenome*
  • Metagenomics* / methods
  • Software
  • Viruses / genetics*