Detecting, Categorizing, and Correcting Coverage Anomalies of RNA-Seq Quantification

Cell Syst. 2019 Dec 18;9(6):589-599.e7. doi: 10.1016/j.cels.2019.10.005. Epub 2019 Nov 27.

Abstract

Because of incomplete reference transcriptomes, incomplete sequencing bias models, or other modeling defects, algorithms to infer isoform expression from RNA sequencing (RNA-seq) sometimes do not accurately model expression. We present a computational method to detect instances where a quantification algorithm could not completely explain the input reads. Our approach identifies regions where the read coverage significantly deviates from expectation. We call these regions "expression anomalies." We further present a method to attribute their cause to either the incompleteness of the reference transcriptome or algorithmic mistakes. We detect anomalies for 30 GEUVADIS and 16 Human Body Map samples. By correcting anomalies when possible, we reduce the number of falsely predicted instances of differential expression. Anomalies that cannot be corrected are suspected to indicate the existence of isoforms unannotated by the reference. We detected 88 common anomalies of this type and find that they tend to have a lower-than-expected coverage toward their 3' ends.

Keywords: RNA-seq; anomaly detection; expression quantification; unannotated isoform.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Base Sequence
  • Databases, Genetic
  • Exome Sequencing / methods
  • Gene Expression Profiling / methods*
  • Humans
  • Protein Isoforms / genetics
  • RNA / genetics
  • RNA-Seq / methods
  • Sequence Analysis, RNA / methods*
  • Transcriptome

Substances

  • Protein Isoforms
  • RNA