Highly efficient clustering of long-read transcriptomic data with GeLuster

Junchi Ma; Xiaoyu Zhao; Enfeng Qi; Renmin Han; Ting Yu; Guojun Li

doi:10.1093/bioinformatics/btae059

Highly efficient clustering of long-read transcriptomic data with GeLuster

Bioinformatics. 2024 Feb 1;40(2):btae059. doi: 10.1093/bioinformatics/btae059.

Authors

Junchi Ma^{1

2}, Xiaoyu Zhao², Enfeng Qi³, Renmin Han¹, Ting Yu¹, Guojun Li¹

Affiliations

¹ Research Center for Mathematics and Interdisciplinary Sciences (Frontiers Science Center for Nonlinear Expectations), Shandong University, Qingdao 266237, China.
² School of Mathematics, Shandong University, Jinan, Shandong 250100, China.
³ School of Mathematics and Statistics, Guangxi Normal University, Guilin 541000, China.

Abstract

Motivation: The advancement of long-read RNA sequencing technologies leads to a bright future for transcriptome analysis, in which clustering long reads according to their gene family of origin is of great importance. However, existing de novo clustering algorithms require plenty of computing resources.

Results: We developed a new algorithm GeLuster for clustering long RNA-seq reads. Based on our tests on one simulated dataset and nine real datasets, GeLuster exhibited superior performance. On the tested Nanopore datasets it ran 2.9-17.5 times as fast as the second-fastest method with less than one-seventh of memory consumption, while achieving higher clustering accuracy. And on the PacBio data, GeLuster also had a similar performance. It sets the stage for large-scale transcriptome study in future.

Availability and implementation: GeLuster is freely available at https://github.com/yutingsdu/GeLuster.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Cluster Analysis
Gene Expression Profiling* / methods
High-Throughput Nucleotide Sequencing / methods
RNA-Seq
Sequence Analysis, DNA / methods
Software
Transcriptome*

Abstract

Publication types

MeSH terms

Grants and funding