Characterizing the impacts of dataset imbalance on single-cell data integration

Hassaan Maan; Lin Zhang; Chengxin Yu; Michael J Geuenich; Kieran R Campbell; Bo Wang

doi:10.1038/s41587-023-02097-9

Characterizing the impacts of dataset imbalance on single-cell data integration

Nat Biotechnol. 2024 Mar 1. doi: 10.1038/s41587-023-02097-9. Online ahead of print.

Authors

Hassaan Maan^{1

2

3}, Lin Zhang^{4

5}, Chengxin Yu^{6

7}, Michael J Geuenich^{6

7}, Kieran R Campbell^{8

9

10

11

12

13}, Bo Wang^{14

15

16

17

18}

Affiliations

¹ Peter Munk Cardiac Centre, University Health Network, Toronto, Ontario, Canada. hassaan.maan@mail.utoronto.ca.
² Vector Institute, Toronto, Ontario, Canada. hassaan.maan@mail.utoronto.ca.
³ Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada. hassaan.maan@mail.utoronto.ca.
⁴ Peter Munk Cardiac Centre, University Health Network, Toronto, Ontario, Canada.
⁵ Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, British Columbia, Canada.
⁶ Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada.
⁷ Lunenfeld-Tanenbaum Research Institute, Toronto, Ontario, Canada.
⁸ Vector Institute, Toronto, Ontario, Canada. kierancampbell@lunenfeld.ca.
⁹ Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada. kierancampbell@lunenfeld.ca.
¹⁰ Lunenfeld-Tanenbaum Research Institute, Toronto, Ontario, Canada. kierancampbell@lunenfeld.ca.
¹¹ Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada. kierancampbell@lunenfeld.ca.
¹² Department of Computer Science, University of Toronto, Toronto, Ontario, Canada. kierancampbell@lunenfeld.ca.
¹³ Ontario Institute for Cancer Research, Toronto, Ontario, Canada. kierancampbell@lunenfeld.ca.
¹⁴ Peter Munk Cardiac Centre, University Health Network, Toronto, Ontario, Canada. bo.wang@uhnresearch.ca.
¹⁵ Vector Institute, Toronto, Ontario, Canada. bo.wang@uhnresearch.ca.
¹⁶ Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada. bo.wang@uhnresearch.ca.
¹⁷ Department of Computer Science, University of Toronto, Toronto, Ontario, Canada. bo.wang@uhnresearch.ca.
¹⁸ Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Ontario, Canada. bo.wang@uhnresearch.ca.

PMID: 38429430
DOI: 10.1038/s41587-023-02097-9

Abstract

Computational methods for integrating single-cell transcriptomic data from multiple samples and conditions do not generally account for imbalances in the cell types measured in different datasets. In this study, we examined how differences in the cell types present, the number of cells per cell type and the cell type proportions across samples affect downstream analyses after integration. The Iniquitate pipeline assesses the robustness of integration results after perturbing the degree of imbalance between datasets. Benchmarking of five state-of-the-art single-cell RNA sequencing integration techniques in 2,600 integration experiments indicates that sample imbalance has substantial impacts on downstream analyses and the biological interpretation of integration results. Imbalance perturbation led to statistically significant variation in unsupervised clustering, cell type classification, differential expression and marker gene annotation, query-to-reference mapping and trajectory inference. We quantified the impacts of imbalance through newly introduced properties-aggregate cell type support and minimum cell type center distance. To better characterize and mitigate impacts of imbalance, we introduce balanced clustering metrics and imbalanced integration guidelines for integration method users.

Abstract

Grants and funding