REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets

Camille Marchet; Zamin Iqbal; Daniel Gautheret; Mikaël Salson; Rayan Chikhi

doi:10.1093/bioinformatics/btaa487

REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets

Bioinformatics. 2020 Jul 1;36(Suppl_1):i177-i185. doi: 10.1093/bioinformatics/btaa487.

Authors

Camille Marchet¹, Zamin Iqbal², Daniel Gautheret³, Mikaël Salson¹, Rayan Chikhi⁴

Affiliations

¹ CNRS, UMR 9189 - CRIStAL, Université de Lille, F-59000 Lille, France.
² European Bioinformatics Institute, Cambridge CB10 1SD, UK.
³ CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), Université Paris-Saclay, Gif-sur-Yvette 91190, France.
⁴ Institut Pasteur, CNRS, C3BI - USR 3756, 75015 Paris, France.

Abstract

Motivation: In this work we present REINDEER, a novel computational method that performs indexing of sequences and records their abundances across a collection of datasets. To the best of our knowledge, other indexing methods have so far been unable to record abundances efficiently across large datasets.

Results: We used REINDEER to index the abundances of sequences within 2585 human RNA-seq experiments in 45 h using only 56 GB of RAM. This makes REINDEER the first method able to record abundances at the scale of ∼4 billion distinct k-mers across 2585 datasets. REINDEER also supports exact presence/absence queries of k-mers. Briefly, REINDEER constructs the compacted de Bruijn graph of each dataset, then conceptually merges those de Bruijn graphs into a single global one. Then, REINDEER constructs and indexes monotigs, which in a nutshell are groups of k-mers of similar abundances.

Availability and implementation: https://github.com/kamimrcht/REINDEER.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Humans
Sequence Analysis, DNA*
Sequence Analysis, RNA
Software*