Mixture models reveal multiple positional bias types in RNA-Seq data and lead to accurate transcript concentration estimates

PLoS Comput Biol. 2017 May 15;13(5):e1005515. doi: 10.1371/journal.pcbi.1005515. eCollection 2017 May.

Abstract

Accuracy of transcript quantification with RNA-Seq is negatively affected by positional fragment bias. This article introduces Mix2 (rd. "mixquare"), a transcript quantification method which uses a mixture of probability distributions to model and thereby neutralize the effects of positional fragment bias. The parameters of Mix2 are trained by Expectation Maximization resulting in simultaneous transcript abundance and bias estimates. We compare Mix2 to Cufflinks, RSEM, eXpress and PennSeq; state-of-the-art quantification methods implementing some form of bias correction. On four synthetic biases we show that the accuracy of Mix2 overall exceeds the accuracy of the other methods and that its bias estimates converge to the correct solution. We further evaluate Mix2 on real RNA-Seq data from the Microarray and Sequencing Quality Control (MAQC, SEQC) Consortia. On MAQC data, Mix2 achieves improved correlation to qPCR measurements with a relative increase in R2 between 4% and 50%. Mix2 also yields repeatable concentration estimates across technical replicates with a relative increase in R2 between 8% and 47% and reduced standard deviation across the full concentration range. We further observe more accurate detection of differential expression with a relative increase in true positives between 74% and 378% for 5% false positives. In addition, Mix2 reveals 5 dominant biases in MAQC data deviating from the common assumption of a uniform fragment distribution. On SEQC data, Mix2 yields higher consistency between measured and predicted concentration ratios. A relative error of 20% or less is obtained for 51% of transcripts by Mix2, 40% of transcripts by Cufflinks and RSEM and 30% by eXpress. Titration order consistency is correct for 47% of transcripts for Mix2, 41% for Cufflinks and RSEM and 34% for eXpress. We, further, observe improved repeatability across laboratory sites with a relative increase in R2 between 8% and 44% and reduced standard deviation.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Computational Biology
  • Gene Expression Profiling / methods*
  • Humans
  • Models, Genetic*
  • Models, Statistical
  • RNA / analysis
  • RNA / genetics
  • RNA / metabolism
  • Sequence Analysis, RNA / methods*

Substances

  • RNA

Grants and funding

This work was supported by: Die Österreichische Forschungsförderungsgesellschaft FFG, https://www.ffg.at, grant nr. 838191: to AT; Wiener ArbeitnehmerInnen Förderungsfond, https://www.waff.at: to GW SG; and Lexogen GmbH, https://www.lexogen.com: to AT GW SG. The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.