Benchmarking Bioinformatic Tools for Amplicon-Based Sequencing of Norovirus

Appl Environ Microbiol. 2023 Jan 31;89(1):e0152222. doi: 10.1128/aem.01522-22. Epub 2022 Dec 21.

Abstract

In order to survey noroviruses in our environment, it is essential that both wet-lab and computational methods are fit for purpose. Using a simulated sequencing data set, denoising-based (DADA2, Deblur and USEARCH-UNOISE3) and clustering-based pipelines (VSEARCH and FROGS) were compared with respect to their ability to represent composition and sequence information. Open source classifiers (Ribosomal Database Project [RDP], BLASTn, IDTAXA, QIIME2 naive Bayes, and SINTAX) were trained using three different databases: a custom database, the NoroNet database, and the Human calicivirus database. Each classifier and database combination was compared from the perspective of their classification accuracy. VSEARCH provides a robust option for analyzing viral amplicons based on composition analysis; however, all pipelines could return OTUs with high similarity to the expected sequences. Importantly, pipeline choice could lead to more false positives (DADA2) or underclassification (FROGS), a key aspect when considering pipeline application for source attribution. Classification was more strongly impacted by the classifier than the database, although disagreement increased with norovirus GII.4 capsid variant designation. We recommend the use of the RDP classifier in conjunction with VSEARCH; however, maintenance of the underlying database is essential for optimal use. IMPORTANCE In benchmarking bioinformatic pipelines for analyzing high-throughput sequencing (HTS) data sets, we provide method standardization for bioinformatics broadly and specifically for norovirus in situations for which no officially endorsed methods exist at present. This study provides recommendations for the appropriate analysis and classification of norovirus amplicon HTS data and will be widely applicable during outbreak investigations.

Keywords: Caliciviridae; calicivirus; classification; clustering; denoising; environmental virology; high-throughput sequencing; in silico.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Bayes Theorem
  • Benchmarking
  • Computational Biology / methods
  • Databases, Factual
  • High-Throughput Nucleotide Sequencing / methods
  • Humans
  • Norovirus* / genetics