Evaluation of machine learning approaches for cell-type identification from single-cell transcriptomics data

Brief Bioinform. 2021 Sep 2;22(5):bbab035. doi: 10.1093/bib/bbab035.

Abstract

Single-cell transcriptomics technologies have vast potential in advancing our understanding of cellular heterogeneity in complex tissues. While methods to interpret single-cell transcriptomics data are developing rapidly, challenges in most analysis pipeline still remain, and the major limitation is a reliance on manual annotations for cell-type identification that is time-consuming, irreproducible, and sometimes lack canonical markers for certain cell types. There is a growing realization of the potential of machine learning models as a supervised classification approach that can significantly aid decision-making processes for cell-type identification. In this work, we performed a comprehensive and impartial evaluation of 10 machine learning models that automatically assign cell phenotypes. The performance of classification methods is estimated by using 20 publicly accessible single-cell RNA sequencing datasets with different sizes, technologies, species and levels of complexity. The performance of each model for within dataset (intra-dataset) and across datasets (inter-dataset) experiments based on the classification accuracy and computation time are both evaluated. Besides, the sensitivity to the number of input features, different annotation levels and dataset complexity was also been estimated. Results showed that most classifiers perform well on a variety of datasets with decreased accuracy for complex datasets, while the Linear Support Vector Machine (linear-SVM) and Logistic Regression classifier models have the best overall performance with remarkably fast computation time. Our work provides a guideline for researchers to select and apply suitable machine learning-based classification models in their analysis workflows and sheds some light on the potential direction of future improvement on automated cell phenotype classification tools based on the single-cell sequencing data.

Keywords: benchmarking; cell identity; classification; machine learning; single-cell RNA sequencing.

MeSH terms

  • Animals
  • Benchmarking
  • Brain / metabolism
  • Brain / pathology
  • Cells, Cultured
  • Datasets as Topic
  • Humans
  • Leukocytes, Mononuclear / cytology
  • Leukocytes, Mononuclear / metabolism
  • Logistic Models
  • Lymphocytes, Tumor-Infiltrating / metabolism
  • Lymphocytes, Tumor-Infiltrating / pathology
  • Mice
  • Pancreas / cytology
  • Pancreas / metabolism
  • Phenotype
  • Single-Cell Analysis / methods*
  • Support Vector Machine / classification*
  • Transcriptome*