RNA-Seq gene profiling--a systematic empirical comparison

PLoS One. 2014 Sep 30;9(9):e107026. doi: 10.1371/journal.pone.0107026. eCollection 2014.

Abstract

Accurately quantifying gene expression levels is a key goal of experiments using RNA-sequencing to assay the transcriptome. This typically requires aligning the short reads generated to the genome or transcriptome before quantifying expression of pre-defined sets of genes. Differences in the alignment/quantification tools can have a major effect upon the expression levels found with important consequences for biological interpretation. Here we address two main issues: do different analysis pipelines affect the gene expression levels inferred from RNA-seq data? And, how close are the expression levels inferred to the "true" expression levels? We evaluate fifty gene profiling pipelines in experimental and simulated data sets with different characteristics (e.g, read length and sequencing depth). In the absence of knowledge of the 'ground truth' in real RNAseq data sets, we used simulated data to assess the differences between the "true" expression and those reconstructed by the analysis pipelines. Even though this approach does not take into account all known biases present in RNAseq data, it still allows to estimate the accuracy of the gene expression values inferred by different analysis pipelines. The results show that i) overall there is a high correlation between the expression levels inferred by the best pipelines and the true quantification values; ii) the error in the estimated gene expression values can vary considerably across genes; and iii) a small set of genes have expression estimates with consistently high error (across data sets and methods). Finally, although the mapping software is important, the quantification method makes a greater difference to the results.

Publication types

  • Comparative Study
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Computer Simulation
  • Empirical Research
  • Gene Expression Profiling*
  • High-Throughput Nucleotide Sequencing
  • Humans
  • RNA
  • Research Design
  • Sequence Analysis, RNA
  • Statistics, Nonparametric
  • Transcriptome

Substances

  • RNA

Grants and funding

This work was supported by European Community's FP7 HEALTH grants CAGEKID (grant agreement 241669). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.