Robust identification of differentially expressed genes from RNA-seq data

Genomics. 2020 Mar;112(2):2000-2010. doi: 10.1016/j.ygeno.2019.11.012. Epub 2019 Nov 20.

Abstract

Background: Identification of differentially expressed genes (DEGs) under two or more experimental conditions is an important task for elucidating the molecular basis of phenotypic variation. In the recent years, next generation sequencing (RNA-seq) has become very attractive and competitive alternative to the microarrays because of reducing the cost of sequencing and limitations of microarrays. A number of methods have been developed for detecting the DEGs from RNA-seq data. Most of these methods are based on either Poisson distribution or negative binomial (NB) distribution. However, identification of DEGs based on read count data using skewed distribution is inflexible and complicated of in presence of outliers or extreme values.

Results: Most of the existing DEGs selection methods produce lower accuracies and higher false discoveries in presence of outliers. There are some robust approaches such as edgeR_robust and DEseq2 perform well in presence of outliers for large sample case. But they show weak performance for small-sample case, in presence of outliers. To address this issues an alternative approach has emerged by transforming the RNA-seq data into microarray like data. Among various transformation methods voom using limma pipeline is proven better for RNA-seq data. However, limma by voom transformation is sensitive to outliers for small-sample case. Therefore, in this paper, we robustify the voom approach using the minimum β-divergence method. We demonstrate the performance of the proposed method in a comparison of seven popular biomarkers selection methods: DEseq, DEseq2, SAMseq, Bayseq, limma (voom), edgeR and edgeR_robust using both simulated and real dataset. Both types of experimental results show that the performance of the proposed method improve over the competing methods, in presence of outliers and in absence of outliers it keeps almost equal performance with these methods.

Conclusion: We observe the improved performance of the proposed method from simulation and real RNA-seq count data analysis for both small-and large-sample cases, in presence of outliers. Therefore, our proposal is to use the proposed method instead of existing methods to obtain the better performance for selecting the DEGs.

Keywords: DEGs; Log-cpm; RNA-sequence data; β-Weight function and robustness.

MeSH terms

  • Algorithms*
  • Animals
  • Gene Expression Profiling / methods*
  • Gene Expression Profiling / standards
  • Humans
  • Mice
  • MicroRNAs / genetics
  • RNA-Seq / methods*
  • RNA-Seq / standards
  • Transcriptome

Substances

  • MicroRNAs