Reducing Alignment Time Complexity of Ultra-Large Sets of Sequences

J Comput Biol. 2017 Nov;24(11):1144-1154. doi: 10.1089/cmb.2017.0097. Epub 2017 Jul 7.

Abstract

The alignment of three or more protein or nucleotide sequences is known as Multiple Sequence Alignment problem. The complexity of this problem increases exponentially with the number of sequences; therefore, many of the current approaches published in the literature suffer a computational overhead when thousands of sequences are required to be aligned. We introduce a new approach for dealing with ultra-large sets of sequences. A two-level clustering method is considered. The first level clusters the input sequences by using their biological composition, that is, the number of positive, negative, polar, special, and hydrophobic amino acids. In the second level, each cluster is divided into different clusters according to their similarity. Then, each cluster is aligned by using any method/aligner. After aligning the centroid sequences of each second-level cluster, we extrapolate the new gaps to each cluster of sequences to obtain the final alignment. We present a study on biological data with up to ∼100,000 sequences, showing that the proposed approach is able to obtain accurate alignments in a reduced amount of time; for example, in >10,000 sequences datasets, it is able to reduce up to ∼45 times the required runtime of the well-known Kalign.

Keywords: algorithms; amino acids; clustering; large sequences; multiple alignment.

MeSH terms

  • Algorithms*
  • Computational Biology / methods*
  • Humans
  • Sequence Alignment / methods*
  • Sequence Analysis, DNA / methods*
  • Sequence Analysis, Protein / methods*