Joint estimation of isoform expression and isoform-specific read distribution using multisample RNA-Seq data

Bioinformatics. 2014 Feb 15;30(4):506-13. doi: 10.1093/bioinformatics/btt704. Epub 2013 Dec 3.

Abstract

Motivation: RNA-sequencing technologies provide a powerful tool for expression analysis at gene and isoform level, but accurate estimation of isoform abundance is still a challenge. Standard assumption of uniform read intensity would yield biased estimates when the read intensity is in fact non-uniform. The problem is that, without strong assumptions, the read intensity pattern is not identifiable from data observed in a single sample.

Results: We develop a joint statistical model that accounts for non-uniform isoform-specific read distribution and gene isoform expression estimation. The main challenge is in dealing with the large number of isoform-specific read distributions, which potentially are as many as the number of splice variants in the genome. A statistical regularization via a smoothing penalty is imposed to control the estimation. Also, for identifiability reasons, the method uses information across samples from the same region. We develop a fast and robust computational procedure based on the iterated-weighted least-squares algorithm, and apply it to simulated data and two real RNA-Seq datasets with reverse transcription-polymerase chain reaction validation. Empirical tests show that our model performs better than existing methods in terms of increasing precision in isoform-level estimation.

Availability and implementation: We have implemented our method in an R package called Sequgio as a pipeline for fast processing of RNA-Seq data.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Alternative Splicing / genetics*
  • Animals
  • Brain / metabolism
  • Computational Biology*
  • Computer Simulation
  • Gene Expression Profiling / methods*
  • High-Throughput Nucleotide Sequencing / methods*
  • Humans
  • Liver / metabolism
  • Mice
  • Models, Statistical*
  • Muscle, Skeletal / metabolism
  • Protein Isoforms
  • RNA, Messenger / genetics
  • Real-Time Polymerase Chain Reaction
  • Reverse Transcriptase Polymerase Chain Reaction
  • Sequence Analysis, RNA / methods*

Substances

  • Protein Isoforms
  • RNA, Messenger