Cell Type Annotation Model Selection: General-Purpose vs. Pattern-Aware Feature Gene Selection in Single-Cell RNA-Seq Data

Genes (Basel). 2023 Feb 26;14(3):596. doi: 10.3390/genes14030596.

Abstract

With the advances in high-throughput sequencing technology, an increasing amount of research in revealing heterogeneity among cells has been widely performed. Differences between individual cells' functionality are determined based on the differences in the gene expression profiles. Although the observations indicate a great performance of clustering methods, manual annotation of the clusters of cells is a challenge yet to be addressed more scalable and faster. On the other hand, due to the lack of enough labelled datasets, just a few supervised techniques have been used in cell type identification, and they obtained more robust results compared to clustering methods. A recent study showed that a complementary step of feature selection helped support vector machine (SVM) to outperform other classifiers in different scenarios. In this article, we compare and evaluate the performance of two state-of-the-art supervised methods, XGBoost and SVM, with information gain as a feature selection method. The results of the experiments on three standard scRNA-seq datasets indicate that XGBoost automatically annotates cell types in a simpler and more scalable framework. Additionally, it sheds light on the potential use of boosting tree approaches combined with deep neural networks to capture underlying information of single-cell RNA-Seq data more effectively. It can be used to identify marker genes and other applications in biological studies.

Keywords: cell type annotation; domain-specific features; feature selection; gradient boosting; scRNA-seq data.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • High-Throughput Nucleotide Sequencing / methods
  • Sequence Analysis, RNA / methods
  • Single-Cell Analysis* / methods
  • Single-Cell Gene Expression Analysis*
  • Transcriptome

Grants and funding

This research was partially funded by the Natural Sciences and Engineering Research Council of Canada, NSERC and the University of Windsor, Office of Research Services and Innovation.