FQC: A novel approach for efficient compression, archival, and dissemination of fastq datasets

J Bioinform Comput Biol. 2015 Jun;13(3):1541003. doi: 10.1142/S0219720015410036. Epub 2015 Feb 8.

Abstract

Sequence data repositories archive and disseminate fastq data in compressed format. In spite of having relatively lower compression efficiency, data repositories continue to prefer GZIP over available specialized fastq compression algorithms. Ease of deployment, high processing speed and portability are the reasons for this preference. This study presents FQC, a fastq compression method that, in addition to providing significantly higher compression gains over GZIP, incorporates features necessary for universal adoption by data repositories/end-users. This study also proposes a novel archival strategy which allows sequence repositories to simultaneously store and disseminate lossless as well as (multiple) lossy variants of fastq files, without necessitating any additional storage requirements. For academic users, Linux, Windows, and Mac implementations (both 32 and 64-bit) of FQC are freely available for download at: https://metagenomics.atc.tcs.com/compression/FQC .

Keywords: Data compaction and compression; NGS data; algorithms for biological data management; sequencing data archival.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Data Compression / methods*
  • Data Curation / methods*
  • Molecular Sequence Data
  • Sequence Analysis, DNA / methods*
  • Software*