On weighted k-mer dictionaries

Algorithms Mol Biol. 2023 Jun 17;18(1):3. doi: 10.1186/s13015-023-00226-2.

Abstract

We consider the problem of representing a set of [Formula: see text]-mers and their abundance counts, or weights, in compressed space so that assessing membership and retrieving the weight of a [Formula: see text]-mer is efficient. The representation is called a weighted dictionary of [Formula: see text]-mers and finds application in numerous tasks in Bioinformatics that usually count [Formula: see text]-mers as a pre-processing step. In fact, [Formula: see text]-mer counting tools produce very large outputs that may result in a severe bottleneck for subsequent processing. In this work we extend the recently introduced SSHash dictionary (Pibiri in Bioinformatics 38:185-194, 2022) to also store compactly the weights of the [Formula: see text]-mers. From a technical perspective, we exploit the order of the [Formula: see text]-mers represented in SSHash to encode runs of weights, hence allowing much better compression than the empirical entropy of the weights. We study the problem of reducing the number of runs in the weights to improve compression even further and give an optimal algorithm for this problem. Lastly, we corroborate our findings with experiments on real-world datasets and comparison with competitive alternatives. Up to date, SSHash is the only [Formula: see text]-mer dictionary that is exact, weighted, associative, fast, and small.

Keywords: Compression; Graphs; Hashing; Path cover; k-mers.