AlignBucket: a tool to speed up 'all-against-all' protein sequence alignments optimizing length constraints

Giuseppe Profiti; Piero Fariselli; Rita Casadio

doi:10.1093/bioinformatics/btv451

AlignBucket: a tool to speed up 'all-against-all' protein sequence alignments optimizing length constraints

Bioinformatics. 2015 Dec 1;31(23):3841-3. doi: 10.1093/bioinformatics/btv451. Epub 2015 Jul 30.

Authors

Giuseppe Profiti¹, Piero Fariselli², Rita Casadio³

Affiliations

¹ Department of Computer Science and Engineering, via Mura Anteo Zamboni 7, Bologna, Bologna Biocomputing group, via S. Giacomo 9/2, Bologna and Health Sciences and Technologies ICIR, via Tolara di Sopra 41/E, Ozzano dell'Emilia, Italy.
² Department of Computer Science and Engineering, via Mura Anteo Zamboni 7, Bologna, Bologna Biocomputing group, via S. Giacomo 9/2, Bologna and.
³ Bologna Biocomputing group, via S. Giacomo 9/2, Bologna and Health Sciences and Technologies ICIR, via Tolara di Sopra 41/E, Ozzano dell'Emilia, Italy.

PMID: 26231432
DOI: 10.1093/bioinformatics/btv451

Abstract

Motivation: The next-generation sequencing era requires reliable, fast and efficient approaches for the accurate annotation of the ever-increasing number of biological sequences and their variations. Transfer of annotation upon similarity search is a standard approach. The procedure of all-against-all protein comparison is a preliminary step of different available methods that annotate sequences based on information already present in databases. Given the actual volume of sequences, methods are necessary to pre-process data to reduce the time of sequence comparison.

Results: We present an algorithm that optimizes the partition of a large volume of sequences (the whole database) into sets where sequence length values (in residues) are constrained depending on a bounded minimal and expected alignment coverage. The idea is to optimally group protein sequences according to their length, and then computing the all-against-all sequence alignments among sequences that fall in a selected length range. We describe a mathematically optimal solution and we show that our method leads to a 5-fold speed-up in real world cases.

Availability and implementation: The software is available for downloading at http://www.biocomp.unibo.it/∼giuseppe/partitioning.html.

Contact: giuseppe.profiti2@unibo.it.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Computational Biology / methods*
Databases, Protein*
Humans
Proteins / chemistry*
Sequence Alignment / methods*
Software*

Substances

Proteins