SMaSH: a scalable, general marker gene identification framework for single-cell RNA-sequencing

BMC Bioinformatics. 2022 Aug 8;23(1):328. doi: 10.1186/s12859-022-04860-2.

Abstract

Background: Single-cell RNA-sequencing is revolutionising the study of cellular and tissue-wide heterogeneity in a large number of biological scenarios, from highly tissue-specific studies of disease to human-wide cell atlases. A central task in single-cell RNA-sequencing analysis design is the calculation of cell type-specific genes in order to study the differential impact of different replicates (e.g. tumour vs. non-tumour environment) on the regulation of those genes and their associated networks. The crucial task is the efficient and reliable calculation of such cell type-specific 'marker' genes. These optimise the ability of the experiment to isolate highly-specific cell phenotypes of interest to the analyser. However, while methods exist that can calculate marker genes from single-cell RNA-sequencing, no such method places emphasise on specific cell phenotypes for downstream study in e.g. differential gene expression or other experimental protocols (spatial transcriptomics protocols for example). Here we present SMaSH, a general computational framework for extracting key marker genes from single-cell RNA-sequencing data which reliably characterise highly-specific and niche populations of cells in numerous different biological data-sets.

Results: SMaSH extracts robust and biologically well-motivated marker genes, which characterise a given single-cell RNA-sequencing data-set better than existing computational approaches for general marker gene calculation. We demonstrate the utility of SMaSH through its substantial performance improvement over several existing methods in the field. Furthermore, we evaluate the SMaSH markers on spatial transcriptomics data, demonstrating they identify highly localised compartments of the mouse cortex.

Conclusion: SMaSH is a new methodology for calculating robust markers genes from large single-cell RNA-sequencing data-sets, and has implications for e.g. effective gene identification for probe design in downstream analyses spatial transcriptomics experiments. SMaSH has been fully-integrated with the ScanPy framework and provides a valuable bioinformatics tool for cell type characterisation and validation in every-growing data-sets spanning over 50 different cell types across hundreds of thousands of cells.

Keywords: Feature selection; Marker genes; Single-cell RNA-sequencing.

MeSH terms

  • Animals
  • Biomarkers
  • Computational Biology* / methods
  • Gene Expression Profiling / methods
  • Humans
  • Mice
  • RNA
  • Sequence Analysis, RNA
  • Single-Cell Analysis / methods
  • Transcriptome*

Substances

  • Biomarkers
  • RNA