Faster sequence homology searches by clustering subsequences

Shuji Suzuki; Masanori Kakuta; Takashi Ishida; Yutaka Akiyama

doi:10.1093/bioinformatics/btu780

Faster sequence homology searches by clustering subsequences

Bioinformatics. 2015 Apr 15;31(8):1183-90. doi: 10.1093/bioinformatics/btu780. Epub 2014 Nov 27.

Authors

Shuji Suzuki¹, Masanori Kakuta², Takashi Ishida², Yutaka Akiyama¹

Affiliations

¹ Graduate School of Information Science and Engineering, Tokyo Institute of Technology and Education Academy of Computational Life Sciences (ACLS), Tokyo Institute of Technology, Tokyo 152-8550, Japan Graduate School of Information Science and Engineering, Tokyo Institute of Technology and Education Academy of Computational Life Sciences (ACLS), Tokyo Institute of Technology, Tokyo 152-8550, Japan.
² Graduate School of Information Science and Engineering, Tokyo Institute of Technology and Education Academy of Computational Life Sciences (ACLS), Tokyo Institute of Technology, Tokyo 152-8550, Japan.

Abstract

Motivation: Sequence homology searches are used in various fields. New sequencing technologies produce huge amounts of sequence data, which continuously increase the size of sequence databases. As a result, homology searches require large amounts of computational time, especially for metagenomic analysis.

Results: We developed a fast homology search method based on database subsequence clustering, and implemented it as GHOSTZ. This method clusters similar subsequences from a database to perform an efficient seed search and ungapped extension by reducing alignment candidates based on triangle inequality. The database subsequence clustering technique achieved an ∼2-fold increase in speed without a large decrease in search sensitivity. When we measured with metagenomic data, GHOSTZ is ∼2.2-2.8 times faster than RAPSearch and is ∼185-261 times faster than BLASTX.

Availability and implementation: The source code is freely available for download at http://www.bi.cs.titech.ac.jp/ghostz/

Contact: akiyama@cs.titech.ac.jp

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Amino Acid Sequence
Animals
Cluster Analysis
Databases, Genetic*
Humans
Metagenomics*
Military Personnel
Molecular Sequence Data
Programming Languages
Sequence Analysis, DNA / methods
Sequence Homology
Software*
Soil / chemistry

Substances

Soil