BIND - an algorithm for loss-less compression of nucleotide sequence data

Tungadri Bose; Monzoorul Haque Mohammed; Anirban Dutta; Sharmila S Mande

doi:10.1007/s12038-012-9230-6

BIND - an algorithm for loss-less compression of nucleotide sequence data

J Biosci. 2012 Sep;37(4):785-9. doi: 10.1007/s12038-012-9230-6.

Authors

Tungadri Bose¹, Monzoorul Haque Mohammed, Anirban Dutta, Sharmila S Mande

Affiliation

¹ Bio-Sciences R&D Division, TCS Innovation Labs, 54B Hadapsar Industrial Estate, Tata Consultancy Services Limited, Hadapsar, Pune 411 013, India.

PMID: 22922203
DOI: 10.1007/s12038-012-9230-6

Abstract

Recent advances in DNA sequencing technologies have enabled the current generation of life science researchers to probe deeper into the genomic blueprint. The amount of data generated by these technologies has been increasing exponentially since the last decade. Storage, archival and dissemination of such huge data sets require efficient solutions, both from the hardware as well as software perspective. The present paper describes BIND-an algorithm specialized for compressing nucleotide sequence data. By adopting a unique 'block-length' encoding for representing binary data (as a key step), BIND achieves significant compression gains as compared to the widely used general purpose compression algorithms (gzip, bzip2 and lzma). Moreover, in contrast to implementations of existing specialized genomic compression approaches, the implementation of BIND is enabled to handle non-ATGC and lowercase characters. This makes BIND a loss-less compression approach that is suitable for practical use. More importantly, validation results of BIND (with real-world data sets) indicate reasonable speeds of compression and decompression that can be achieved with minimal processor/ memory usage. BIND is available for download at http://metagenomics.atc.tcs.com/compression/BIND. No license is required for academic or non-profit use.

MeSH terms

Algorithms*
Base Sequence
Computing Methodologies
Data Compression / methods*
Information Storage and Retrieval*
Sequence Analysis, DNA*
Software