Misclassifications in human papillomavirus databases

Virology. 2021 Jun:558:57-66. doi: 10.1016/j.virol.2021.03.002. Epub 2021 Mar 11.

Abstract

We assessed the quality of human papillomavirus (HPV) sequences in GenBank by analyzing the possible presence of chimeras, "wrong-assembled" contigs and errors in taxonomy using an open-source script (HPVChimera_Gb) that compared 25 638 HPV-related nucleotide sequences in GenBank with the 221 numbered HPV types and another 220 complete HPV sequences. There were 110 sequences with taxonomy/naming errors (sequences reported as another HPV type than the one they corresponded to) and 1318 possibly chimeric sequences. Manual analysis found plausible explanations for most of them (e.g. sequence covering an integration site) but 114 sequences appeared to be chimeras (96/114 were already flagged as "unverified" by GenBank) and 13 had taxonomy/naming errors. When comparing all correct HPV sequences in GenBank, there appeared to exist about 800 unique putative HPV types. Systematic and regular work towards eliminating chimeric sequences and taxonomy/naming errors could increase the quality and order in HPV research.

Keywords: Chimera; HPVChimera; Human papillomavirus; International HPV Reference center.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Amino Acid Sequence
  • Base Sequence
  • Classification
  • Databases, Nucleic Acid*
  • Humans
  • Papillomaviridae / classification*
  • Papillomaviridae / genetics*