ClusterDE: a post-clustering differential expression (DE) method robust to false-positive inflation caused by double dipping

Dongyuan Song; Kexin Li; Xinzhou Ge; Jingyi Jessica Li

doi:10.21203/rs.3.rs-3211191/v1

ClusterDE: a post-clustering differential expression (DE) method robust to false-positive inflation caused by double dipping

Res Sq [Preprint]. 2023 Aug 2:rs.3.rs-3211191. doi: 10.21203/rs.3.rs-3211191/v1.

Authors

Dongyuan Song¹, Kexin Li², Xinzhou Ge², Jingyi Jessica Li^{1

2

3

4

5

6}

Affiliations

¹ Bioinformatics Interdepartmental Ph.D. Program, University of California, Los Angeles, CA 90095-7246.
² Department of Statistics, University of California, Los Angeles, CA 90095-1554.
³ Department of Human Genetics, University of California, Los Angeles, CA 90095-7088.
⁴ Department of Computational Medicine, University of California, Los Angeles, CA 90095-1766.
⁵ Department of Biostatistics, University of California, Los Angeles, CA 90095-1772.
⁶ Radcliffe Institute for Advanced Study, Harvard University, Cambridge, MA 02138.

Abstract

In typical single-cell RNA-seq (scRNA-seq) data analysis, a clustering algorithm is applied to find putative cell types as clusters, and then a statistical differential expression (DE) test is employed to identify the differentially expressed (DE) genes between the cell clusters. However, this common procedure uses the same data twice, an issue known as "double dipping": the same data is used twice to define cell clusters as potential cell types and DE genes as potential cell-type marker genes, leading to false-positive cell-type marker genes even when the cell clusters are spurious. To overcome this challenge, we propose ClusterDE, a post-clustering DE method for controlling the false discovery rate (FDR) of identified DE genes regardless of clustering quality, which can work as an add-on to popular pipelines such as Seurat. The core idea of ClusterDE is to generate real-data-based synthetic null data containing only one cluster, as contrast to the real data, for evaluating the whole procedure of clustering followed by a DE test. Using comprehensive simulation and real data analysis, we show that ClusterDE has not only solid FDR control but also the ability to identify cell-type marker genes as top DE genes and distinguish them from housekeeping genes. ClusterDE is fast, transparent, and adaptive to a wide range of clustering algorithms and DE tests. Besides scRNA-seq data, ClusterDE is generally applicable to post-clustering DE analysis, including single-cell multi-omics data analysis.

Publication types

Preprint

Abstract

Publication types

Grants and funding