CROMqs: An infinitesimal successive refinement lossy compressor for the quality scores

J Bioinform Comput Biol. 2020 Dec;18(6):2050031. doi: 10.1142/S0219720020500316. Epub 2020 Sep 16.

Abstract

The amount of sequencing data is growing at a fast pace due to a rapid revolution in sequencing technologies. Quality scores, which indicate the reliability of each of the called nucleotides, take a significant portion of the sequencing data. In addition, quality scores are more challenging to compress than nucleotides, and they are often noisy. Hence, a natural solution to further decrease the size of the sequencing data is to apply lossy compression to the quality scores. Lossy compression may result in a loss in precision, however, it has been shown that when operating at some specific rates, lossy compression can achieve performance on variant calling similar to that achieved with the losslessly compressed data (i.e. the original data). We propose Coding with Random Orthogonal Matrices for quality scores (CROMqs), the first lossy compressor designed for the quality scores with the "infinitesimal successive refinability" property. With this property, the encoder needs to compress the data only once, at a high rate, while the decoder can decompress it iteratively. The decoder can reconstruct the set of quality scores at each step with reduced distortion each time. This characteristic is specifically useful in sequencing data compression, since the encoder does not generally know what the most appropriate rate of compression is, e.g. for not degrading variant calling accuracy. CROMqs avoids the need of having to compress the data at multiple rates, hence incurring time savings. In addition to this property, we show that CROMqs obtains a comparable rate-distortion performance to the state-of-the-art lossy compressors. Moreover, we also show that it achieves a comparable performance on variant calling to that of the lossless compressed data while achieving more than 50% reduction in size.

Keywords: Rateless compression; sequencing data compression; variant calling.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Chromosomes, Human, Pair 20 / genetics
  • Computational Biology
  • Computer Simulation
  • Data Compression / methods*
  • Data Compression / standards
  • Data Compression / statistics & numerical data
  • Databases, Genetic / statistics & numerical data
  • Fourier Analysis
  • High-Throughput Nucleotide Sequencing / methods*
  • High-Throughput Nucleotide Sequencing / standards
  • High-Throughput Nucleotide Sequencing / statistics & numerical data
  • Humans
  • Software