Novel Data Transformations for RNA-seq Differential Expression Analysis

Zeyu Zhang; Danyang Yu; Minseok Seo; Craig P Hersh; Scott T Weiss; Weiliang Qiu

doi:10.1038/s41598-019-41315-w

Novel Data Transformations for RNA-seq Differential Expression Analysis

Sci Rep. 2019 Mar 18;9(1):4820. doi: 10.1038/s41598-019-41315-w.

Authors

Zeyu Zhang¹, Danyang Yu², Minseok Seo³, Craig P Hersh³, Scott T Weiss³, Weiliang Qiu⁴

Affiliations

¹ Department of Bioinformatics, School of Life Sciences and Technology, Tongji University, Shanghai, China.
² Department of Information and Computing Science, College of Mathematics and Econometrics, Hunan University, Hunan, China.
³ Channing Division of Network Medicine, Brigham and Women's Hospital/Harvard Medical School, Boston, USA.
⁴ Channing Division of Network Medicine, Brigham and Women's Hospital/Harvard Medical School, Boston, USA. stwxq@channing.harvard.edu.

Abstract

We propose eight data transformations (r, r2, rv, rv2, l, l2, lv, and lv2) for RNA-seq data analysis aiming to make the transformed sample mean to be representative of the distribution center since it is not always possible to transform count data to satisfy the normality assumption. Simulation studies showed that for data sets with small (e.g., nCases = nControls = 3) or large sample size (e.g., nCases = nControls = 100) limma based on data from the l, l2, and r2 transformations performed better than limma based on data from the voom transformation in term of accuracy, FDR, and FNR. For datasets with moderate sample size (e.g., nCases = nControls = 30 or 50), limma with the rv and rv2 transformations performed similarly to limma with the voom transformation. Real data analysis results are consistent with simulation analysis results: limma with the r, l, r2, and l2 transformation performed better than limma with the voom transformation when sample sizes are small or large; limma with the rv and rv2 transformations performed similarly to limma with the voom transformation when sample sizes are moderate. We also observed from our data analyses that for datasets with large sample size, the gene-selection via the Wilcoxon rank sum test (a non-parametric two sample test method) based on the raw data outperformed limma based on the transformed data.

Publication types

Comparative Study
Research Support, N.I.H., Extramural

MeSH terms

Algorithms
Computational Biology / methods*
Data Interpretation, Statistical
Datasets as Topic
Feasibility Studies
Humans
Models, Statistical*
RNA-Seq / methods*
Sample Size
Software

Abstract

Publication types

MeSH terms

Grants and funding