Generating human papillomavirus (HPV) reference databases to maximize genomic mapping

Victor Trevino; Mariel Oyervides; Genaro A Ramírez-Correa; Lourdes Garza

doi:10.1007/s00705-021-05256-y

Generating human papillomavirus (HPV) reference databases to maximize genomic mapping

Arch Virol. 2022 Jan;167(1):57-65. doi: 10.1007/s00705-021-05256-y. Epub 2021 Oct 19.

Authors

Victor Trevino¹, Mariel Oyervides², Genaro A Ramírez-Correa^{3

4}, Lourdes Garza⁵

Affiliations

¹ Tecnológico de Monterrey, Escuela de Medicina y Ciencias de la Salud, 64710, Monterrey, Nuevo León, Mexico. vtrevino@tec.mx.
² Tecnológico de Monterrey, Escuela de Ingeniería y Ciencias, 64849, Monterrey, Nuevo León, Mexico.
³ Department of Molecular Science, UT Health Rio Grande Valley, McAllen, TX, 78502, USA.
⁴ Division of Cardiology, Department of Pediatrics, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA.
⁵ Centro Universitario Contra el Cáncer (CUCC), Servicio de Oncología, Universidad Autónoma de Nuevo León, Hospital Universitario "Dr. José Eleuterio Gonzáalez", 64460, Monterrey, Nuevo León, Mexico.

PMID: 34668074
DOI: 10.1007/s00705-021-05256-y

Abstract

Genomic experiments analyzing human papillomaviruses (HPVs) require a carefully selected list of sequences as a reference database to map millions of reads. The available sources, such as the Papillomavirus Episteme (PaVE), are organized based on variations in the L1 gene rather than the whole HPV sequence. Moreover, the PaVE process uses complex multiple sequence alignments containing hundreds or thousands of sequences. These issues complicate the generation of a reference database for genomics, leading to the generation of per-analysis-defined databases. Here, we propose a de novo strategy considering all HPV sequences reported in the NCBI database to define a subset of highly representative HPV sequences. The strategy is based on oligonucleotide frequency profiling of the whole sequence followed by hierarchical clustering. Using data from HPV capture experiments, we demonstrate that this strategy selects suitable sequences as a reference database to map most mappable reads unambiguously. We provide some recommendations to improve HPV mapping. The generated .fasta files can be accessed at https://github.com/vtrevino/HPV-Ref-Genomes .

MeSH terms

Alphapapillomavirus*
Chromosome Mapping
Genomics
Humans
Papillomaviridae / genetics
Papillomavirus Infections*

Abstract

MeSH terms

Grants and funding