Batch-Corrected Distance Mitigates Temporal and Spatial Variability for Clustering and Visualization of Single-Cell Gene Expression Data

Res Sq [Preprint]. 2023 Jul 26:rs.3.rs-3134332. doi: 10.21203/rs.3.rs-3134332/v1.

Abstract

Clustering and visualization are essential parts of single-cell gene expression data analysis. The Euclidean distance used in most distance-based methods is not optimal. The batch effect, i.e., the variability among samples gathered from different times, tissues, and patients, introduces large between-group distance and obscures the true identities of cells. To solve this problem, we introduce Batch-Corrected Distance (BCD), a metric using temporal/spatial locality of the batch effect to control for such factors. We validate BCD on simulated data as well as applied it to a mouse retina development dataset and a lung dataset. We also found the utility of our approach in understanding the progression of the Coronavirus Disease 2019 (COVID-19). BCD achieves more accurate clusters and better visualizations than state-of-the-art batch correction methods on longitudinal datasets. BCD can be directly integrated with most clustering and visualization methods to enable more scientific findings.

Publication types

  • Preprint