scTransSort: Transformers for Intelligent Annotation of Cell Types by Gene Embeddings

Biomolecules. 2023 Mar 28;13(4):611. doi: 10.3390/biom13040611.

Abstract

Single-cell transcriptomics is rapidly advancing our understanding of the composition of complex tissues and biological cells, and single-cell RNA sequencing (scRNA-seq) holds great potential for identifying and characterizing the cell composition of complex tissues. Cell type identification by analyzing scRNA-seq data is mostly limited by time-consuming and irreproducible manual annotation. As scRNA-seq technology scales to thousands of cells per experiment, the exponential increase in the number of cell samples makes manual annotation more difficult. On the other hand, the sparsity of gene transcriptome data remains a major challenge. This paper applied the idea of the transformer to single-cell classification tasks based on scRNA-seq data. We propose scTransSort, a cell-type annotation method pretrained with single-cell transcriptomics data. The scTransSort incorporates a method of representing genes as gene expression embedding blocks to reduce the sparsity of data used for cell type identification and reduce the computational complexity. The feature of scTransSort is that its implementation of intelligent information extraction for unordered data, automatically extracting valid features of cell types without the need for manually labeled features and additional references. In experiments on cells from 35 human and 26 mouse tissues, scTransSort successfully elucidated its high accuracy and high performance for cell type identification, and demonstrated its own high robustness and generalization ability.

Keywords: annotation; cell type; classification; identity; scRNA-seq; transformer.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Animals
  • Gene Expression Profiling* / methods
  • Humans
  • Mice
  • Sequence Analysis, RNA / methods
  • Single-Cell Analysis* / methods
  • Transcriptome

Grants and funding

This work was supported by the National Key Research and Development Project of China (2021YFA1000102 and 2021YFA1000103), Natural Science Foundation of China (Grant Nos. 61873280, 61972416, 62272479, 62202498), Taishan Scholarship (tsqn201812029), Foundation of Science and Technology Development of Jinan (201907116), Shandong Provincial Natural Science Foundation (ZR2021QF023), Fundamental Research Funds for the Central Universities (21CX06018A), Spanish project PID2019-106960GB-I00, Juan de la Cierva IJC2018-038539-I.