Computing the distribution of the Robinson-Foulds distance

Comput Biol Chem. 2020 May 19:87:107284. doi: 10.1016/j.compbiolchem.2020.107284. Online ahead of print.

Abstract

With the exponential growth of genome databases, the importance of phylogenetics has increased dramatically over the past years. Studying phylogenetic trees enables us not only to understand how genes, genomes, and species evolve, but also helps us predict how they might change in future. One of the crucial aspects of phylogenetics is the comparison of two or more phylogenetic trees. There are different metrics for computing the dissimilarity between a pair of trees. The Robinson-Foulds (RF) distance is one of the widely used metrics on the space of labeled trees. The distribution of the RF distance from a given tree has been studied before, but the fastest known algorithm for computing this distribution is a slow, albeit polynomial-time, O(l5) algorithm. In this paper, we modify the dynamic programming algorithm for computing the distribution of this distance for a given tree by leveraging the number-theoretic transform (NTT), and improve the running time from O(l5) to O(l3logl), where l is the number of tips of the tree. In addition to its practical usefulness, our method represents a theoretical novelty, as it is, to our knowledge, one of the rare applications of the number-theoretic transform for solving a computational biology problem.

Keywords: Convolution; Null distribution; Number-theoretic transform; Phylogenetics; RF distance.