An efficient algorithm for finding short approximate non-tandem repeats

E F Adebiyi; T Jiang; M Kaufmann

doi:10.1093/bioinformatics/17.suppl_1.s5

An efficient algorithm for finding short approximate non-tandem repeats

Bioinformatics. 2001:17 Suppl 1:S5-S12. doi: 10.1093/bioinformatics/17.suppl_1.s5.

Authors

E F Adebiyi¹, T Jiang, M Kaufmann

Affiliation

¹ Wilhelm-Schickard-Institut für Informatik, Universität Tübingen, Sand 13, Tübingen, 72076, Germany. adebiyi@informatik.uni-tuebingen.de

PMID: 11472987
DOI: 10.1093/bioinformatics/17.suppl_1.s5

Abstract

We study the problem of approximate non-tandem repeat extraction. Given a long subject string S of length N over a finite alphabet Sigma and a threshold D, we would like to find all short substrings of S of length P that repeat with at most D differences, i.e., insertions, deletions, and mismatches. We give a careful theoretical characterization of the set of seeds (i.e., some maximal exact repeats) required by the algorithm, and prove a sublinear bound on their expected numbers. Using this result, we present a sub-quadratic algorithm for finding all short (i.e., of length O(log N)) approximate repeats. The running time of our algorithm is O(DN(3pow(epsilon)-1)log N), where epsilon = D/P and pow(epsilon) is an increasing, concave function that is 0 when epsilon = 0 and about 0.9 for DNA and protein sequences.

Publication types

Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms*
Computational Biology
Databases, Genetic
Repetitive Sequences, Nucleic Acid*