contamDE-lm: linear model-based differential gene expression analysis using next-generation RNA-seq data from contaminated tumor samples

Bioinformatics. 2020 Apr 15;36(8):2492-2499. doi: 10.1093/bioinformatics/btaa006.

Abstract

Motivation: Tumor and adjacent normal RNA samples are commonly used to screen differentially expressed genes between normal and tumor samples or among tumor subtypes. Such paired-sample design could avoid numerous confounders in differential expression (DE) analysis, but the cellular contamination of tumor samples can be an important noise and confounding factor, which can both inflate false-positive rate and deflate true-positive rate. The existing DE tools that use next-generation RNA-seq data either do not account for cellular contamination or are computationally extensive with increasingly large sample size.

Results: A novel linear model was proposed to avoid the problem that could arise from tumor-normal correlation for paired samples. A statistically robust and computationally very fast DE analysis procedure, contamDE-lm, was developed based on the novel model to account for cellular contamination, boosting DE analysis power through the reduction in individual residual variances using gene-wise information. The desired advantages of contamDE-lm over some state-of-the-art methods (limma and DESeq2) were evaluated through the applications to simulated data, TCGA database and Gene Expression Omnibus (GEO) database.

Availability and implementation: The proposed method contamDE-lm was implemented in an updated R package contamDE (version 2.0), which is freely available at https://github.com/zhanghfd/contamDE.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Humans
  • Linear Models
  • Neoplasms* / genetics
  • RNA-Seq*
  • Sequence Analysis, RNA
  • Software