Determining clinically relevant features in cytometry data using persistent homology

Soham Mukherjee; Darren Wethington; Tamal K Dey; Jayajit Das

doi:10.1101/2021.04.26.441473

Determining clinically relevant features in cytometry data using persistent homology

bioRxiv [Preprint]. 2021 Apr 27:2021.04.26.441473. doi: 10.1101/2021.04.26.441473.

Authors

Soham Mukherjee, Darren Wethington, Tamal K Dey, Jayajit Das

Abstract

Cytometry experiments yield high-dimensional point cloud data that is difficult to interpret manually. Boolean gating techniques coupled with comparisons of relative abundances of cellular subsets is the current standard for cytometry data analysis. However, this approach is unable to capture more subtle topological features hidden in data, especially if those features are further masked by data transforms or significant batch effects or donor-to-donor variations in clinical data. We present that persistent homology, a mathematical structure that summarizes the topological features, can distinguish different sources of data, such as from groups of healthy donors or patients, effectively. Analysis of publicly available cytometry data describing non-naïve CD8+ T cells in COVID-19 patients and healthy controls shows that systematic structural differences exist between single cell protein expressions in COVID-19 patients and healthy controls. Our method identifies proteins of interest by a decision-tree based classifier and passes them to a kernel-density estimator (KDE) for sampling points from the density distribution. We then compute persistence diagrams from these sampled points. The resulting persistence diagrams identify regions in cytometry datasets of varying density and identify protruded structures such as 'elbows'. We compute Wasserstein distances between these persistence diagrams for random pairs of healthy controls and COVID-19 patients and find that systematic structural differences exist between COVID-19 patients and healthy controls in the expression data for T-bet, Eomes, and Ki-67. Further analysis shows that expression of T-bet and Eomes are significantly downregulated in COVID-19 patient non-naïve CD8+ T cells compared to healthy controls. This counter-intuitive finding may indicate that canonical effector CD8+ T cells are less prevalent in COVID-19 patients than healthy controls. This method is applicable to any cytometry dataset for discovering novel insights through topological data analysis which may be difficult to ascertain otherwise with a standard gating strategy or in the presence of large batch effects.

Author summary: Identifying differences between cytometry data seen as a point cloud can be complicated by random variations in data collection and data sources. We apply persistent homology used in topological data analysis to describe the shape and structure of the data representing immune cells in healthy donors and COVID-19 patients. By looking at how the shape and structure differ between healthy donors and COVID-19 patients, we are able to definitively conclude how these groups differ despite random variations in the data. Furthermore, these results are novel in their ability to capture shape and structure of cytometry data, something not described by other analyses.

Publication types

Preprint

Grants and funding

R01 AI146581/AI/NIAID NIH HHS/United States