SeqCompress: an algorithm for biological sequence compression

Genomics. 2014 Oct;104(4):225-8. doi: 10.1016/j.ygeno.2014.08.007. Epub 2014 Aug 27.

Abstract

The growth of Next Generation Sequencing technologies presents significant research challenges, specifically to design bioinformatics tools that handle massive amount of data efficiently. Biological sequence data storage cost has become a noticeable proportion of total cost in the generation and analysis. Particularly increase in DNA sequencing rate is significantly outstripping the rate of increase in disk storage capacity, which may go beyond the limit of storage capacity. It is essential to develop algorithms that handle large data sets via better memory management. This article presents a DNA sequence compression algorithm SeqCompress that copes with the space complexity of biological sequences. The algorithm is based on lossless data compression and uses statistical model as well as arithmetic coding to compress DNA sequences. The proposed algorithm is compared with recent specialized compression tools for biological sequences. Experimental results show that proposed algorithm has better compression gain as compared to other existing algorithms.

Keywords: Compression; DNA; Genome sequences; NGS technologies.

Publication types

  • Comparative Study

MeSH terms

  • Algorithms*
  • Data Compression / methods*
  • Sequence Analysis, DNA / methods*
  • Sequence Analysis, Protein / methods