CoCo: RNA-seq read assignment correction for nested genes and multimapped reads

Gabrielle Deschamps-Francoeur; Vincent Boivin; Sherif Abou Elela; Michelle S Scott

doi:10.1093/bioinformatics/btz433

CoCo: RNA-seq read assignment correction for nested genes and multimapped reads

Bioinformatics. 2019 Dec 1;35(23):5039-5047. doi: 10.1093/bioinformatics/btz433.

Authors

Gabrielle Deschamps-Francoeur¹, Vincent Boivin¹, Sherif Abou Elela², Michelle S Scott¹

Affiliations

¹ Department of Biochemistry and RNA Group, Université de Sherbrooke, Sherbrooke, QC, Canada.
² Department of Microbiology and Infectiology, Faculty of Medicine and Health Sciences, Université de Sherbrooke, Sherbrooke, QC, Canada.

Abstract

Motivation: Next-generation sequencing techniques revolutionized the study of RNA expression by permitting whole transcriptome analysis. However, sequencing reads generated from nested and multi-copy genes are often either misassigned or discarded, which greatly reduces both quantification accuracy and gene coverage.

Results: Here we present count corrector (CoCo), a read assignment pipeline that takes into account the multitude of overlapping and repetitive genes in the transcriptome of higher eukaryotes. CoCo uses a modified annotation file that highlights nested genes and proportionally distributes multimapped reads between repeated sequences. CoCo salvages over 15% of discarded aligned RNA-seq reads and significantly changes the abundance estimates for both coding and non-coding RNA as validated by PCR and bedgraph comparisons.

Availability and implementation: The CoCo software is an open source package written in Python and available from http://gitlabscottgroup.med.usherbrooke.ca/scott-group/coco.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

High-Throughput Nucleotide Sequencing
Nested Genes
RNA-Seq*
Sequence Analysis, RNA
Software
Transcriptome