Fast sequence clustering using a suffix array algorithm

Ketil Malde; Eivind Coward; Inge Jonassen

doi:10.1093/bioinformatics/btg138

Fast sequence clustering using a suffix array algorithm

Bioinformatics. 2003 Jul 1;19(10):1221-6. doi: 10.1093/bioinformatics/btg138.

Authors

Ketil Malde¹, Eivind Coward, Inge Jonassen

Affiliation

¹ Department of Informatics, University of Bergen, HIB, N5020 Norway. ketil@ii.uib.no

PMID: 12835265
DOI: 10.1093/bioinformatics/btg138

Abstract

Motivation: Efficient clustering is important for handling the large amount of available EST sequences. Most contemporary methods are based on some kind of all-against-all comparison, resulting in a quadratic time complexity. A different approach is needed to keep up with the rapid growth of EST data.

Results: A new, fast EST clustering algorithm is presented. Sub-quadratic time complexity is achieved by using an algorithm based on suffix arrays. A prototype implementation has been developed and run on a benchmark data set. The produced clusterings are validated by comparing them to clusterings produced by other methods, and the results are quite promising.

Availability: The source code for the prototype implementation is available under a GPL license from http://www.ii.uib.no/~ketil/bio/.

Publication types

Comparative Study
Evaluation Study
Research Support, Non-U.S. Gov't
Validation Study

MeSH terms

Algorithms*
Cluster Analysis*
Expressed Sequence Tags*
Gene Expression Profiling / methods*
Pattern Recognition, Automated
Reproducibility of Results
Sensitivity and Specificity
Sequence Alignment / methods*
Sequence Analysis / methods*
Sequence Homology