Nucleotide-level distance metrics to quantify alternative splicing implemented in TranD

Nucleic Acids Res. 2024 Mar 21;52(5):e28. doi: 10.1093/nar/gkae056.

Abstract

Advances in affordable transcriptome sequencing combined with better exon and gene prediction has motivated many to compare transcription across the tree of life. We develop a mathematical framework to calculate complexity and compare transcript models. Structural features, i.e. intron retention (IR), donor/acceptor site variation, alternative exon cassettes, alternative 5'/3' UTRs, are compared and the distance between transcript models is calculated with nucleotide level precision. All metrics are implemented in a PyPi package, TranD and output can be used to summarize splicing patterns for a transcriptome (1GTF) and between transcriptomes (2GTF). TranD output enables quantitative comparisons between: annotations augmented by empirical RNA-seq data and the original transcript models; transcript model prediction tools for longread RNA-seq (e.g. FLAIR versus Isoseq3); alternate annotations for a species (e.g. RefSeq vs Ensembl); and between closely related species. In C. elegans, Z. mays, D. melanogaster, D. simulans and H. sapiens, alternative exons were observed more frequently in combination with an alternative donor/acceptor than alone. Transcript models in RefSeq and Ensembl are linked and both have unique transcript models with empirical support. D. melanogaster and D. simulans, share many transcript models and long-read RNAseq data suggests that both species are under-annotated. We recommend combined references.

MeSH terms

  • Alternative Splicing*
  • Animals
  • Caenorhabditis elegans / genetics
  • Drosophila melanogaster / genetics
  • Gene Expression Profiling
  • Nucleotides
  • RNA Splicing
  • Sequence Analysis, RNA
  • Software
  • Species Specificity
  • Transcriptome* / genetics

Substances

  • Nucleotides