cloudrnaSPAdes: isoform assembly using bulk barcoded RNA sequencing data

Dmitry Meleshko; Andrey D Prjbelski; Mikhail Raiko; Alexandru I Tomescu; Hagen Tilgner; Iman Hajirasouliha

doi:10.1093/bioinformatics/btad781

cloudrnaSPAdes: isoform assembly using bulk barcoded RNA sequencing data

Bioinformatics. 2024 Feb 1;40(2):btad781. doi: 10.1093/bioinformatics/btad781.

Authors

Dmitry Meleshko^{1

2}, Andrey D Prjbelski³, Mikhail Raiko⁴, Alexandru I Tomescu³, Hagen Tilgner^{5

6}, Iman Hajirasouliha^{2

7}

Affiliations

¹ Tri-Institutional Computational Biology & Medicine Program, Weill Cornell Medicine of Cornell University, New York, NY 10021, United States.
² Department of Physiology and Biophysics, Institute for Computational Biomedicine, Weill Cornell Medicine, New York, NY 10021, United States.
³ Department of Computer Science, University of Helsinki, Helsinki 00014, Finland.
⁴ Center for Algorithmic Biotechnology, Institute for Translational Biomedicine, St Petersburg State University, St Petersburg 199004, Russia.
⁵ Brain and Mind Research Institute, Weill Cornell Medicine, New York, NY 10021, United States.
⁶ Center for Neurogenetics, Weill Cornell Medicine, New York, NY 10021, United States.
⁷ Englander Institute for Precision Medicine, The Meyer Cancer Center, Weill Cornell Medicine, New York, NY 10021, United States.

Abstract

Motivation: Recent advancements in long-read RNA sequencing have enabled the examination of full-length isoforms, previously uncaptured by short-read sequencing methods. An alternative powerful method for studying isoforms is through the use of barcoded short-read RNA reads, for which a barcode indicates whether two short-reads arise from the same molecule or not. Such techniques included the 10x Genomics linked-read based SParse Isoform Sequencing (SPIso-seq), as well as Loop-Seq, or Tell-Seq. Some applications, such as novel-isoform discovery, require very high coverage. Obtaining high coverage using long reads can be difficult, making barcoded RNA-seq data a valuable alternative for this task. However, most annotation pipelines are not able to work with a set of short reads instead of a single transcript, also not able to work with coverage gaps within a molecule if any. In order to overcome this challenge, we present an RNA-seq assembler that allows the determination of the expressed isoform per barcode.

Results: In this article, we present cloudrnaSPAdes, a tool for assembling full-length isoforms from barcoded RNA-seq linked-read data in a reference-free fashion. Evaluating it on simulated and real human data, we found that cloudrnaSPAdes accurately assembles isoforms, even for genes with high isoform diversity.

Availability and implementation: cloudrnaSPAdes is a feature release of a SPAdes assembler and version used for this article is available at https://github.com/1dayac/cloudrnaSPAdes-release.

MeSH terms

Genomics* / methods
High-Throughput Nucleotide Sequencing
Humans
Protein Isoforms / genetics
Protein Isoforms / metabolism
RNA* / genetics
RNA-Seq
Sequence Analysis, RNA / methods
Transcriptome

Substances

RNA
Protein Isoforms

Grants and funding

R35 GM138152/GM/NIGMS NIH HHS/United States