Flexiplex: a versatile demultiplexer and search tool for omics data

Oliver Cheng; Min Hao Ling; Changqing Wang; Shuyi Wu; Matthew E Ritchie; Jonathan Göke; Noorul Amin; Nadia M Davidson

doi:10.1093/bioinformatics/btae102

Flexiplex: a versatile demultiplexer and search tool for omics data

Bioinformatics. 2024 Mar 4;40(3):btae102. doi: 10.1093/bioinformatics/btae102.

Authors

Oliver Cheng^{1

2

3}, Min Hao Ling⁴, Changqing Wang^{5

6}, Shuyi Wu^{1

2

3}, Matthew E Ritchie^{5

6}, Jonathan Göke^{4

7}, Noorul Amin^{1

2

6}, Nadia M Davidson^{1

2

6}

Affiliations

¹ Blood Cells and Blood Cancer Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, VIC 3052, Australia.
² Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, VIC 3052, Australia.
³ Faculty of Science, The University of Melbourne, Parkville, VIC 3010, Australia.
⁴ Department for Epigenetic and Epitranscriptomic Regulation, Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), Singapore 138672, Republic of Singapore.
⁵ Epigenetics and Development Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, VIC 3052, Australia.
⁶ Department of Medical Biology, Faculty of Medicine, Dentistry and Health Sciences, The University of Melbourne, Parkville, VIC 3010, Australia.
⁷ Department of Statistics and Data Science, National University of Singapore, Singapore 117546, Republic of Singapore.

Abstract

Motivation: The process of analyzing high throughput sequencing data often requires the identification and extraction of specific target sequences. This could include tasks, such as identifying cellular barcodes and UMIs in single-cell data, and specific genetic variants for genotyping. However, existing tools, which perform these functions are often task-specific, such as only demultiplexing barcodes for a dedicated type of experiment, or are not tolerant to noise in the sequencing data.

Results: To overcome these limitations, we developed Flexiplex, a versatile and fast sequence searching and demultiplexing tool for omics data, which is based on the Levenshtein distance and thus allows imperfect matches. We demonstrate Flexiplex's application on three use cases, identifying cell-line-specific sequences in Illumina short-read single-cell data, and discovering and demultiplexing cellular barcodes from noisy long-read single-cell RNA-seq data. We show that Flexiplex achieves an excellent balance of accuracy and computational efficiency compared to leading task-specific tools.

Availability and implementation: Flexiplex is available at https://davidsongroup.github.io/flexiplex/.

MeSH terms

Electronic Data Processing
High-Throughput Nucleotide Sequencing
Search Engine*
Sequence Analysis, DNA
Software*

Grants and funding

GNT2016547/NHMRC