ntHash2: recursive spaced seed hashing for nucleotide sequences

Parham Kazemi; Johnathan Wong; Vladimir Nikolić; Hamid Mohamadi; René L Warren; Inanç Birol

doi:10.1093/bioinformatics/btac564

ntHash2: recursive spaced seed hashing for nucleotide sequences

Bioinformatics. 2022 Oct 14;38(20):4812-4813. doi: 10.1093/bioinformatics/btac564.

Authors

Parham Kazemi^{1

2}, Johnathan Wong¹, Vladimir Nikolić¹, Hamid Mohamadi³, René L Warren¹, Inanç Birol^{1

4}

Affiliations

¹ Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada.
² Faculty of Science, University of British Columbia, Vancouver, BC V6T 1Z4, Canada.
³ Amazon Web Services Inc., Seattle, WA 98109, USA.
⁴ Department of Medical Genetics, University of British Columbia, Vancouver, BC V6T 1Z3, Canada.

Abstract

Motivation: Spaced seeds are robust alternatives to k-mers in analyzing nucleotide sequences with high base mismatch rates. Hashing is also crucial for efficiently storing abundant sequence data. Here, we introduce ntHash2, a fast algorithm for spaced seed hashing that can be integrated into various bioinformatics tools for efficient sequence analysis with applications in genome research.

Results: ntHash2 is up to 2.1× faster at hashing various spaced seeds than the previous version and 3.8× faster than conventional hashing algorithms with naïve adaptation. Additionally, we reduced the collision rate of ntHash for longer k-mer lengths and improved the uniformity of the hash distribution by modifying the canonical hashing mechanism.

Availability and implementation: ntHash2 is freely available online at github.com/bcgsc/ntHash under an MIT license.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Algorithms*
Base Sequence
Seeds
Sequence Analysis, DNA
Software*