Normalizing need not be the norm: count-based math for analyzing single-cell data

Samuel H Church; Jasmine L Mah; Günter Wagner; Casey W Dunn

doi:10.1007/s12064-023-00408-x

Normalizing need not be the norm: count-based math for analyzing single-cell data

Theory Biosci. 2024 Feb;143(1):45-62. doi: 10.1007/s12064-023-00408-x. Epub 2023 Nov 10.

Authors

Samuel H Church¹, Jasmine L Mah², Günter Wagner^{2

3

4

5}, Casey W Dunn²

Affiliations

¹ Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT, USA. samuelhchurch@gmail.com.
² Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT, USA.
³ Yale Systems Biology Institute, Yale University, New Haven, CT, USA.
⁴ Department of Obstetrics, Gynecology and Reproductive Sciences, Yale Medical School, New Haven, CT, USA.
⁵ Department of Obstetrics and Gynecology, Wayne State University, Detroit, MI, USA.

PMID: 37947999
DOI: 10.1007/s12064-023-00408-x

Abstract

Counting transcripts of mRNA are a key method of observation in modern biology. With advances in counting transcripts in single cells (single-cell RNA sequencing or scRNA-seq), these data are routinely used to identify cells by their transcriptional profile, and to identify genes with differential cellular expression. Because the total number of transcripts counted per cell can vary for technical reasons, the first step of many commonly used scRNA-seq workflows is to normalize by sequencing depth, transforming counts into proportional abundances. The primary objective of this step is to reshape the data such that cells with similar biological proportions of transcripts end up with similar transformed measurements. But there is growing concern that normalization and other transformations result in unintended distortions that hinder both analyses and the interpretation of results. This has led to an intense focus on optimizing methods for normalization and transformation of scRNA-seq data. Here, we take an alternative approach, by avoiding normalization and transformation altogether. We abandon the use of distances to compare cells, and instead use a restricted algebra, motivated by measurement theory and abstract algebra, that preserves the count nature of the data. We demonstrate that this restricted algebra is sufficient to draw meaningful and practical comparisons of gene expression through the use of the dot product and other elementary operations. This approach sidesteps many of the problems with common transformations, and has the added benefit of being simpler and more intuitive. We implement our approach in the package countland, available in python and R.

MeSH terms

Gene Expression Profiling / methods
Sequence Analysis, RNA / methods
Single-Cell Analysis* / methods
Software*

Grants and funding

NSF 2109502/Directorate for Biological Sciences