ACO:lossless quality score compression based on adaptive coding order

BMC Bioinformatics. 2022 Jun 7;23(1):219. doi: 10.1186/s12859-022-04712-z.

Abstract

Background: With the rapid development of high-throughput sequencing technology, the cost of whole genome sequencing drops rapidly, which leads to an exponential growth of genome data. How to efficiently compress the DNA data generated by large-scale genome projects has become an important factor restricting the further development of the DNA sequencing industry. Although the compression of DNA bases has achieved significant improvement in recent years, the compression of quality score is still challenging.

Results: In this paper, by reinvestigating the inherent correlations between the quality score and the sequencing process, we propose a novel lossless quality score compressor based on adaptive coding order (ACO). The main objective of ACO is to traverse the quality score adaptively in the most correlative trajectory according to the sequencing process. By cooperating with the adaptive arithmetic coding and an improved in-context strategy, ACO achieves the state-of-the-art quality score compression performances with moderate complexity for the next-generation sequencing (NGS) data.

Conclusions: The competence enables ACO to serve as a candidate tool for quality score compression, ACO has been employed by AVS(Audio Video coding Standard Workgroup of China) and is freely available at https://github.com/Yoniming/ACO.

Keywords: Adaptive coding order; High-throughput sequencing; Lossless compression; Quality score compression.

MeSH terms

  • Algorithms
  • DNA
  • Data Compression*
  • High-Throughput Nucleotide Sequencing
  • Sequence Analysis, DNA
  • Software*

Substances

  • DNA