Clustered sequence representation for fast homology search

Michael Cameron; Yaniv Bernstein; Hugh E Williams

doi:10.1089/cmb.2007.R005

Clustered sequence representation for fast homology search

J Comput Biol. 2007 Jun;14(5):594-614. doi: 10.1089/cmb.2007.R005.

Authors

Michael Cameron¹, Yaniv Bernstein, Hugh E Williams

Affiliation

¹ School of Computer Science and Information Technology, RMIT University, Melbourne, Australia. mcam@cs.rmit.edu.au

PMID: 17683263
DOI: 10.1089/cmb.2007.R005

Abstract

We present a novel approach to managing redundancy in sequence databanks such as GenBank. We store clusters of near-identical sequences as a representative union-sequence and a set of corresponding edits to that sequence. During search, the query is compared to only the union-sequences representing each cluster; cluster members are then only reconstructed and aligned if the union-sequence achieves a sufficiently high score. Using this approach with BLAST results in a 27% reduction in collection size and a corresponding 22% decrease in search time with no significant change in accuracy. We also describe our method for clustering that uses fingerprinting, an approach that has been successfully applied to collections of text and web documents in Information Retrieval. Our clustering approach is ten times faster on the GenBank nonredundant protein database than the fastest existing approach, CD-HIT. We have integrated our approach into FSA-BLAST, our new Open Source version of BLAST (available from http://www.fsa-blast.org/). As a result, FSA-BLAST is twice as fast as NCBI-BLAST with no significant change in accuracy.

Publication types

Research Support, Non-U.S. Gov't
Review

MeSH terms

Amino Acid Sequence
Animals
Databases, Protein* / trends
Humans
Molecular Sequence Data
Sequence Alignment / methods*
Sequence Alignment / trends
Sequence Analysis, Protein / methods*
Sequence Analysis, Protein / trends
Sequence Homology, Amino Acid*