Identifying gene expression programs in single-cell RNA-seq data using linear correlation explanation

J Biomed Inform. 2024 Apr 15:104644. doi: 10.1016/j.jbi.2024.104644. Online ahead of print.

Abstract

Objective: Gene expression analysis through single-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of gene regulation in diverse cell types, tissues, and organisms. While existing methods primarily focus on identifying cell type-specific gene expression programs (GEPs), the characterization of GEPs associated with biological processes and stimuli responses remains limited. In this study, we aim to infer biologically meaningful GEPs that are associated with both cellular phenotypes and activity programs directly from scRNA-seq data.

Methods: We applied linear CorEx, a machine-learning-based approach, to infer GEPs by grouping genes based on total correlation optimization function in simulated and real-world scRNA-seq datasets. Additionally, we utilized a transfer learning approach to project CorEx-inferred GEPs to other scRNA-seq datasets.

Results: By leveraging total correlation optimization, linear CorEx groups genes and demonstrates superior performance in identifying cell types and activity programs compared to similar methods using simulated data. Furthermore, we apply this same approach to real-world scRNA-seq data from the mouse dentate gyrus and embryonic colon development, uncovering biologically relevant GEPs related to cell types, developmental ages, and cell cycle programs. We also demonstrate the potential for transfer learning by evaluating similar datasets, showcasing the cross-species sensitivity of linear CorEx.

Conclusion: Our findings validate linear CorEx as a valuable tool for comprehensively analyzing complex signals in scRNA-seq data, leading to deeper insights into gene expression dynamics, cellular heterogeneity, and regulatory mechanisms.

Keywords: Developmental biology; Machine learning; Single cell; Transfer learning; scRNA-seq.