CAREx: context-aware read extension of paired-end sequencing data

Felix Kallenborn; Bertil Schmidt

doi:10.1186/s12859-024-05802-w

CAREx: context-aware read extension of paired-end sequencing data

BMC Bioinformatics. 2024 May 10;25(1):186. doi: 10.1186/s12859-024-05802-w.

Authors

Felix Kallenborn¹, Bertil Schmidt²

Affiliations

¹ Department of Computer Science, Johannes Gutenberg University Mainz, Mainz, Germany. kallenborn@uni-mainz.de.
² Department of Computer Science, Johannes Gutenberg University Mainz, Mainz, Germany.

Abstract

Background: Commonly used next generation sequencing machines typically produce large amounts of short reads of a few hundred base-pairs in length. However, many downstream applications would generally benefit from longer reads.

Results: We present CAREx-an algorithm for the generation of pseudo-long reads from paired-end short-read Illumina data based on the concept of repeatedly computing multiple-sequence-alignments to extend a read until its partner is found. Our performance evaluation on both simulated data and real data shows that CAREx is able to connect significantly more read pairs (up to $99 %$ for simulated data) and to produce more error-free pseudo-long reads than previous approaches. When used prior to assembly it can achieve superior de novo assembly results. Furthermore, the GPU-accelerated version of CAREx exhibits the fastest execution times among all tested tools.

Conclusion: CAREx is a new MSA-based algorithm and software for producing pseudo-long reads from paired-end short read data. It outperforms other state-of-the-art programs in terms of (i) percentage of connected read pairs, (ii) reduction of error rates of filled gaps, (iii) runtime, and (iv) downstream analysis using de novo assembly. CAREx is open-source software written in C++ (CPU version) and in CUDA/C++ (GPU version). It is licensed under GPLv3 and can be downloaded at ( https://github.com/fkallen/CAREx ).

Keywords: GPU; Next-generation sequencing; Pseudo-long reads.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
High-Throughput Nucleotide Sequencing* / methods
Humans
Sequence Alignment / methods
Sequence Analysis, DNA / methods
Software*