Computation of rank and select functions on hierarchical binary string and its application to genome mapping problems for short-read DNA sequences

Kouichi Kimura; Yutaka Suzuki; Sumio Sugano; Asako Koike

doi:10.1089/cmb.2008.0146

Computation of rank and select functions on hierarchical binary string and its application to genome mapping problems for short-read DNA sequences

J Comput Biol. 2009 Nov;16(11):1601-13. doi: 10.1089/cmb.2008.0146.

Authors

Kouichi Kimura¹, Yutaka Suzuki, Sumio Sugano, Asako Koike

Affiliation

¹ Central Research Laboratory, Hitachi, Ltd., Tokyo, Japan. kouichi.kimura.hh@hitachi.com

PMID: 19772398
DOI: 10.1089/cmb.2008.0146

Abstract

Abstract We have developed efficient in-practice algorithms for computing rank and select functions on a binary string, based on a novel data structure, a hierarchical binary string with hierarchical accumulatives. It efficiently stores decomposed information on partial summations over various scales of subregions of a given binary string, so that the required space overhead ratio is only about 3.5% irrespective of the string length. Values of rank and select functions are computed hierarchically in [(log(2)n)/8] iterations, where n is the string length. For example, for an unbiased random binary string of 64 G bits, each value of these functions can be computed in about a microsecond, on average, on a single 3.0-GHz CPU using 8+ GB of memory. We also present their applications to genome mapping problems for large-scale short-read DNA sequence data, especially produced by ultra-high-throughput new-generation DNA sequencers. The algorithms are applied to the binarization of the Burrows-Wheeler transform of the human genome DNA sequence. For the sake of high-speed performance, we adopted a somewhat stringent mapping condition that allows at most a single-base mismatch (either a substitution, insertion, or deletion of a single base) per query sequence. An experimentally implemented program mapped several thousands of sequences per second on a single 3.0-GHz CPU, several times faster than ELAND, a widely used mapping program with the Illumina-Solexa 1G analyser.

MeSH terms

Algorithms
Base Sequence
Chromosome Mapping / methods*
Computational Biology / methods*
Genome, Human / genetics*
Humans
Oligonucleotide Array Sequence Analysis
Time Factors