ALFRED: A Practical Method for Alignment-Free Distance Computation

J Comput Biol. 2016 Jun;23(6):452-60. doi: 10.1089/cmb.2015.0217. Epub 2016 May 3.

Abstract

Alignment-free approaches are gaining persistent interest in many sequence analysis applications such as phylogenetic inference and metagenomic classification/clustering, especially for large-scale sequence datasets. Besides the widely used k-mer methods, the average common substring (ACS) approach has emerged to be one of the well-known alignment-free approaches. Two recent works further generalize this ACS approach by allowing a bounded number k of mismatches in the common substrings, relying on approximation (linear time) and exact computation, respectively. Albeit having a good worst-case time complexity [Formula: see text], the exact approach is complex and unlikely to be efficient in practice. Herein, we present ALFRED, an alignment-free distance computation method, which solves the generalized common substring search problem via exact computation. Compared to the theoretical approach, our algorithm is easier to implement and more practical to use, while still providing highly competitive theoretical performances with an expected run-time of [Formula: see text]. By applying our program to phylogenetic inference as a case study, we find that our program facilitates to exactly reconstruct the topology of the reference phylogenetic tree for a set of 27 primate mitochondrial genomes, at reasonably acceptable speed. ALFRED is implemented in C++ programming language and the source code is freely available online.

Keywords: approximate string matching; phylogenetic tree; suffix trees.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms
  • Animals
  • Computational Biology / methods*
  • Genome, Mitochondrial
  • Metagenomics
  • Phylogeny
  • Primates / genetics*
  • Sequence Alignment / methods*