Modeling bias and variation in the stochastic processes of small RNA sequencing

Christos Argyropoulos; Alton Etheridge; Nikita Sakhanenko; David Galas

doi:10.1093/nar/gkx199

Modeling bias and variation in the stochastic processes of small RNA sequencing

Nucleic Acids Res. 2017 Jun 20;45(11):e104. doi: 10.1093/nar/gkx199.

Authors

Christos Argyropoulos¹, Alton Etheridge², Nikita Sakhanenko², David Galas²

Affiliations

¹ Department of Internal Medicine, University of New Mexico School of Medicine, Albuquerque, NM 87106, USA.
² Pacific Northwest Research Institute, Seattle, WA 98122, USA.

Abstract

The use of RNA-seq as the preferred method for the discovery and validation of small RNA biomarkers has been hindered by high quantitative variability and biased sequence counts. In this paper we develop a statistical model for sequence counts that accounts for ligase bias and stochastic variation in sequence counts. This model implies a linear quadratic relation between the mean and variance of sequence counts. Using a large number of sequencing datasets, we demonstrate how one can use the generalized additive models for location, scale and shape (GAMLSS) distributional regression framework to calculate and apply empirical correction factors for ligase bias. Bias correction could remove more than 40% of the bias for miRNAs. Empirical bias correction factors appear to be nearly constant over at least one and up to four orders of magnitude of total RNA input and independent of sample composition. Using synthetic mixes of known composition, we show that the GAMLSS approach can analyze differential expression with greater accuracy, higher sensitivity and specificity than six existing algorithms (DESeq2, edgeR, EBSeq, limma, DSS, voom) for the analysis of small RNA-seq data.

MeSH terms

Algorithms
Data Accuracy
Linear Models
Models, Genetic
Poisson Distribution
Sequence Analysis, RNA*
Software
Stochastic Processes

Grants and funding

UL1 TR001449/TR/NCATS NIH HHS/United States