Minimum Spanning vs. Principal Trees for Structured Approximations of Multi-Dimensional Datasets

Alexander Chervov; Jonathan Bac; Andrei Zinovyev

doi:10.3390/e22111274

Minimum Spanning vs. Principal Trees for Structured Approximations of Multi-Dimensional Datasets

Entropy (Basel). 2020 Nov 11;22(11):1274. doi: 10.3390/e22111274.

Authors

Alexander Chervov^{1

2

3}, Jonathan Bac^{1

2

3

4}, Andrei Zinovyev^{1

2

3

5}

Affiliations

¹ Institut Curie, PSL Research University, F-75005 Paris, France.
² Institut national de la santé et de la recherche médicale, U900, F-75005 Paris, France.
³ CBIO-Centre for Computational Biology, Mines ParisTech, PSL Research University, 75006 Paris, France.
⁴ Centre de Recherches Interdisciplinaires, Université de Paris, F-75000 Paris, France.
⁵ Lobachevsky University, 603000 Nizhny Novgorod, Russia.

Abstract

Construction of graph-based approximations for multi-dimensional data point clouds is widely used in a variety of areas. Notable examples of applications of such approximators are cellular trajectory inference in single-cell data analysis, analysis of clinical trajectories from synchronic datasets, and skeletonization of images. Several methods have been proposed to construct such approximating graphs, with some based on computation of minimum spanning trees and some based on principal graphs generalizing principal curves. In this article we propose a methodology to compare and benchmark these two graph-based data approximation approaches, as well as to define their hyperparameters. The main idea is to avoid comparing graphs directly, but at first to induce clustering of the data point cloud from the graph approximation and, secondly, to use well-established methods to compare and score the data cloud partitioning induced by the graphs. In particular, mutual information-based approaches prove to be useful in this context. The induced clustering is based on decomposing a graph into non-branching segments, and then clustering the data point cloud by the nearest segment. Such a method allows efficient comparison of graph-based data approximations of arbitrary topology and complexity. The method is implemented in Python using the standard scikit-learn library which provides high speed and efficiency. As a demonstration of the methodology we analyse and compare graph-based data approximation methods using synthetic as well as real-life single cell datasets.

Keywords: clustering; data analysis; graph theory; minimum spanning trees; principal trees; single-cell transcriptomics; trajectory inference.

Grants and funding

ANR-19-P3IA-0001/Agence Nationale de la Recherche