K-nearest-neighbors induced topological PCA for single cell RNA-sequence data analysis

Sean Cottrell; Yuta Hozumi; Guo-Wei Wei

doi:10.1016/j.compbiomed.2024.108497

K-nearest-neighbors induced topological PCA for single cell RNA-sequence data analysis

Comput Biol Med. 2024 Jun:175:108497. doi: 10.1016/j.compbiomed.2024.108497. Epub 2024 Apr 24.

Authors

Sean Cottrell¹, Yuta Hozumi¹, Guo-Wei Wei²

Affiliations

¹ Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA.
² Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA; Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI 48824, USA; Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA. Electronic address: weig@msu.edu.

PMID: 38678944
PMCID: PMC11090715 (available on 2024-06-01)
DOI: 10.1016/j.compbiomed.2024.108497

Abstract

Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell-cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes involved. Therefore, dimensionality reduction and feature selection are important for removing spurious signals and enhancing downstream analysis. Traditional PCA, a main workhorse in dimensionality reduction, lacks the ability to capture geometrical structure information embedded in the data, and previous graph Laplacian regularizations are limited by the analysis of only a single scale. We propose a topological Principal Components Analysis (tPCA) method by the combination of persistent Laplacian (PL) technique and L_2,1 norm regularization to address multiscale and multiclass heterogeneity issues in data. We further introduce a k-Nearest-Neighbor (kNN) persistent Laplacian technique to improve the robustness of our persistent Laplacian method. The proposed kNN-PL is a new algebraic topology technique which addresses the many limitations of the traditional persistent homology. Rather than inducing filtration via the varying of a distance threshold, we introduced kNN-tPCA, where filtrations are achieved by varying the number of neighbors in a kNN network at each step, and find that this framework has significant implications for hyper-parameter tuning. We validate the efficacy of our proposed tPCA and kNN-tPCA methods on 11 diverse benchmark scRNA-seq datasets, and showcase that our methods outperform other unsupervised PCA enhancements from the literature, as well as popular Uniform Manifold Approximation (UMAP), t-Distributed Stochastic Neighbor Embedding (tSNE), and Projection Non-Negative Matrix Factorization (NMF) by significant margins. For example, tPCA provides up to 628%, 78%, and 149% improvements to UMAP, tSNE, and NMF, respectively on classification in the F1 metric, and kNN-tPCA offers 53%, 63%, and 32% improvements to UMAP, tSNE, and NMF, respectively on clustering in the ARI metric.

Keywords: Clustering; Dimensionality reduction; Machine learning; Persistent Laplacian; Persistent homology; Topology; scRNA-seq.

MeSH terms

Algorithms
Humans
Principal Component Analysis*
RNA-Seq / methods
Sequence Analysis, RNA* / methods
Single-Cell Analysis* / methods

Abstract

MeSH terms

Grants and funding