LFQC: a lossless compression algorithm for FASTQ files

Marius Nicolae; Sudipta Pathak; Sanguthevar Rajasekaran

doi:10.1093/bioinformatics/btv384

LFQC: a lossless compression algorithm for FASTQ files

Bioinformatics. 2015 Oct 15;31(20):3276-81. doi: 10.1093/bioinformatics/btv384. Epub 2015 Jun 20.

Authors

Marius Nicolae¹, Sudipta Pathak¹, Sanguthevar Rajasekaran¹

Affiliation

¹ Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269-4155, USA.

Abstract

Motivation: Next Generation Sequencing (NGS) technologies have revolutionized genomic research by reducing the cost of whole genome sequencing. One of the biggest challenges posed by modern sequencing technology is economic storage of NGS data. Storing raw data is infeasible because of its enormous size and high redundancy. In this article, we address the problem of storage and transmission of large FASTQ files using innovative compression techniques.

Results: We introduce a new lossless non-reference based FASTQ compression algorithm named Lossless FASTQ Compressor. We have compared our algorithm with other state of the art big data compression algorithms namely gzip, bzip2, fastqz (Bonfield and Mahoney, 2013), fqzcomp (Bonfield and Mahoney, 2013), Quip (Jones et al., 2012), DSRC2 (Roguski and Deorowicz, 2014). This comparison reveals that our algorithm achieves better compression ratios on LS454 and SOLiD datasets.

Availability and implementation: The implementations are freely available for non-commercial purposes. They can be downloaded from http://engr.uconn.edu/rajasek/lfqc-v1.1.zip.

Contact: rajasek@engr.uconn.edu.

Publication types

Research Support, N.I.H., Extramural
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms*
Data Compression / methods*
Genomics
High-Throughput Nucleotide Sequencing / methods*
Information Storage and Retrieval
Sequence Analysis, DNA / methods*

Grants and funding

R01-LM010101/LM/NLM NIH HHS/United States