Highly efficient clustering of long-read transcriptomic data with GeLuster

Bioinformatics. 2024 Feb 1;40(2):btae059. doi: 10.1093/bioinformatics/btae059.

Abstract

Motivation: The advancement of long-read RNA sequencing technologies leads to a bright future for transcriptome analysis, in which clustering long reads according to their gene family of origin is of great importance. However, existing de novo clustering algorithms require plenty of computing resources.

Results: We developed a new algorithm GeLuster for clustering long RNA-seq reads. Based on our tests on one simulated dataset and nine real datasets, GeLuster exhibited superior performance. On the tested Nanopore datasets it ran 2.9-17.5 times as fast as the second-fastest method with less than one-seventh of memory consumption, while achieving higher clustering accuracy. And on the PacBio data, GeLuster also had a similar performance. It sets the stage for large-scale transcriptome study in future.

Availability and implementation: GeLuster is freely available at https://github.com/yutingsdu/GeLuster.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Cluster Analysis
  • Gene Expression Profiling* / methods
  • High-Throughput Nucleotide Sequencing / methods
  • RNA-Seq
  • Sequence Analysis, DNA / methods
  • Software
  • Transcriptome*