Fast sequence clustering using a suffix array algorithm

Bioinformatics. 2003 Jul 1;19(10):1221-6. doi: 10.1093/bioinformatics/btg138.

Abstract

Motivation: Efficient clustering is important for handling the large amount of available EST sequences. Most contemporary methods are based on some kind of all-against-all comparison, resulting in a quadratic time complexity. A different approach is needed to keep up with the rapid growth of EST data.

Results: A new, fast EST clustering algorithm is presented. Sub-quadratic time complexity is achieved by using an algorithm based on suffix arrays. A prototype implementation has been developed and run on a benchmark data set. The produced clusterings are validated by comparing them to clusterings produced by other methods, and the results are quite promising.

Availability: The source code for the prototype implementation is available under a GPL license from http://www.ii.uib.no/~ketil/bio/.

Publication types

  • Comparative Study
  • Evaluation Study
  • Research Support, Non-U.S. Gov't
  • Validation Study

MeSH terms

  • Algorithms*
  • Cluster Analysis*
  • Expressed Sequence Tags*
  • Gene Expression Profiling / methods*
  • Pattern Recognition, Automated
  • Reproducibility of Results
  • Sensitivity and Specificity
  • Sequence Alignment / methods*
  • Sequence Analysis / methods*
  • Sequence Homology