VHost-Classifier: virus-host classification using natural language processing

Bioinformatics. 2019 Oct 1;35(19):3867-3869. doi: 10.1093/bioinformatics/btz151.

Abstract

Motivation: When analyzing viral metagenomic sequences, it is often desired to filter the results of a BLAST analysis by the host species of the virus. VHost-Classifier automates this procedure using a natural language processing algorithm written in Python 3, which takes a list of taxonomic identifiers (taxids) returned from a BLAST query using viral sequences as input. The taxid output is binned by the evolutionary lineage of their host, based on string matching the words in their English names. If VHost-Classifier cannot identify a host, it attempts to bin the sequences by the environment from which the sample originated. VHost-Classifier predicts the evolutionary lineage of the host from the virus name and does not rely on referencing taxids against a database; therefore, it is not constrained by the size of a database and can host classify newly characterized viruses.

Results: Benchmarked on a test dataset of 1000 randomly selected viral taxids on the NCBI taxonomy database, VHost-Classifier assigned, with 100% accuracy, a host to the rank of Class for >93% of viruses, and to the rank of Family for >37% of viruses.

Availability and implementation: For more information about VHost-Classifier as well as implementation instructions, visit https://github.com/Kzra/VHost-Classifier.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Databases, Genetic
  • Metagenome
  • Natural Language Processing*
  • Viruses*