PgRC: pseudogenome-based read compressor

Tomasz M Kowalski; Szymon Grabowski

doi:10.1093/bioinformatics/btz919

PgRC: pseudogenome-based read compressor

Bioinformatics. 2020 Apr 1;36(7):2082-2089. doi: 10.1093/bioinformatics/btz919.

Authors

Tomasz M Kowalski¹, Szymon Grabowski¹

Affiliation

¹ Institute of Applied Computer Science, Lodz University of Technology, Lodz 90-924, Poland.

PMID: 31893286
DOI: 10.1093/bioinformatics/btz919

Abstract

Motivation: The amount of sequencing data from high-throughput sequencing technologies grows at a pace exceeding the one predicted by Moore's law. One of the basic requirements is to efficiently store and transmit such huge collections of data. Despite significant interest in designing FASTQ compressors, they are still imperfect in terms of compression ratio or decompression resources.

Results: We present Pseudogenome-based Read Compressor (PgRC), an in-memory algorithm for compressing the DNA stream, based on the idea of building an approximation of the shortest common superstring over high-quality reads. Experiments show that PgRC wins in compression ratio over its main competitors, SPRING and Minicom, by up to 15 and 20% on average, respectively, while being comparably fast in decompression.

Availability and implementation: PgRC can be downloaded from https://github.com/kowallus/PgRC.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Data Compression*
Genomics
High-Throughput Nucleotide Sequencing
Sequence Analysis, DNA
Software*