Parametric analysis of RNA-seq expression data

Genes Cells. 2016 Jun;21(6):639-47. doi: 10.1111/gtc.12372. Epub 2016 May 20.

Abstract

Various methods had been introduced for normalization and comparison of RNA-seq count data. However, they lacked objectivity because they based on ad hoc assumptions that were never verified their appropriateness. Here, we introduced a method that assumes parsimony models on data distribution; the assumptions were verified according to exploratory data analysis. As was expected, count data were lognormally distributed. The level of noise in recent data appeared to be much higher than those of microarrays. Still, the appropriate distribution model would improve certainty and accuracy of normalization, by finding out the reliable range of data. Primary cause of noise was not the principle of the methodology; that is, each read is a trial that which transcript is read. Rather, the cause would be overlooking of transcripts, and the overlooking occurred more often among lower range of data. To find out genes likely to be overlooked, number of replications would be more important than read depth, which will not prevent overlooking. Both signal and noise in the reliable range of data were distributed normally, showing the suitability to use generalized linear model to evaluate differences in expression levels. In the framework, normalized data can be compared and combined freely beyond studies.

MeSH terms

  • Animals
  • Databases, Genetic
  • Mice
  • Models, Statistical*
  • National Library of Medicine (U.S.)
  • Sequence Analysis, RNA / methods*
  • United States