Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences

Enrico Seiler; Svenja Mehringer; Mitra Darvish; Etienne Turc; Knut Reinert

doi:10.1016/j.isci.2021.102782

Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences

iScience. 2021 Jun 24;24(7):102782. doi: 10.1016/j.isci.2021.102782. eCollection 2021 Jul 23.

Authors

Enrico Seiler^{1

2}, Svenja Mehringer¹, Mitra Darvish², Etienne Turc³, Knut Reinert¹

Affiliations

¹ Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany.
² Efficient Algorithms for Omics Data, Max Planck Institute for Molecular Genetics, Berlin, Germany.
³ ENSTA, Paris, France.

Abstract

We present Raptor, a system for approximately searching many queries such as next-generation sequencing reads or transcripts in large collections of nucleotide sequences. Raptor uses winnowing minimizers to define a set of representative k-mers, an extension of the interleaved Bloom filters (IBFs) as a set membership data structure and probabilistic thresholding for minimizers. Our approach allows compression and partitioning of the IBF to enable the effective use of secondary memory. We test and show the performance and limitations of the new features using simulated and real datasets. Our data structure can be used to accelerate various core bioinformatics applications. We show this by re-implementing the distributed read mapping tool DREAM-Yara.

Keywords: bioinformatics; genetics; high-performance computing in bioinformatics.