Global, highly specific and fast filtering of alignment seeds

BMC Bioinformatics. 2022 Jun 10;23(1):225. doi: 10.1186/s12859-022-04745-4.

Abstract

Background: An important initial phase of arguably most homology search and alignment methods such as required for genome alignments is seed finding. The seed finding step is crucial to curb the runtime as potential alignments are restricted to and anchored at the sequence position pairs that constitute the seed. To identify seeds, it is good practice to use sets of spaced seed patterns, a method that locally compares two sequences and requires exact matches at certain positions only.

Results: We introduce a new method for filtering alignment seeds that we call geometric hashing. Geometric hashing achieves a high specificity by combining non-local information from different seeds using a simple hash function that only requires a constant and small amount of additional time per spaced seed. Geometric hashing was tested on the task of finding homologous positions in the coding regions of human and mouse genome sequences. Thereby, the number of false positives was decreased about million-fold over sets of spaced seeds while maintaining a very high sensitivity.

Conclusions: An additional geometric hashing filtering phase could improve the run-time, accuracy or both of programs for various homology-search-and-align tasks.

Keywords: Alignment; Alignment anchor; Genome alignment; Geometric hashing; Spaced seeds.

MeSH terms

  • Algorithms*
  • Animals
  • Genome*
  • Mice
  • Sequence Alignment