scGCC: Graph Contrastive Clustering With Neighborhood Augmentations for scRNA-Seq Data Analysis

IEEE J Biomed Health Inform. 2023 Dec;27(12):6133-6143. doi: 10.1109/JBHI.2023.3319551. Epub 2023 Dec 5.

Abstract

Single-cell RNA sequencing (scRNA-seq) has rapidly emerged as a powerful technique for analyzing cellular heterogeneity at the individual cell level. In the analysis of scRNA-seq data, cell clustering is a critical step in downstream analysis, as it enables the identification of cell types and the discovery of novel cell subtypes. However, the characteristics of scRNA-seq data, such as high dimensionality and sparsity, dropout events and batch effects, present significant computational challenges for clustering analysis. In this study, we propose scGCC, a novel graph self-supervised contrastive learning model, to address the challenges faced in scRNA-seq data analysis. scGCC comprises two main components: a representation learning module and a clustering module. The scRNA-seq data is first fed into a representation learning module for training, which is then used for data classification through a clustering module. scGCC can learn low-dimensional denoised embeddings, which is advantageous for our clustering task. We introduce Graph Attention Networks (GAT) for cell representation learning, which enables better feature extraction and improved clustering accuracy. Additionally, we propose five data augmentation methods to improve clustering performance by increasing data diversity and reducing overfitting. These methods enhance the robustness of clustering results. Our experimental study on 14 real-world datasets has demonstrated that our model achieves extraordinary accuracy and robustness. We also perform downstream tasks, including batch effect removal, trajectory inference, and marker genes analysis, to verify the biological effectiveness of our model.

MeSH terms

  • Algorithms
  • Cluster Analysis
  • Data Analysis
  • Gene Expression Profiling / methods
  • Humans
  • Single-Cell Analysis* / methods
  • Single-Cell Gene Expression Analysis*