Minirmd: accurate and fast duplicate removal tool for short reads via multiple minimizers

Yuansheng Liu; Xiaocai Zhang; Quan Zou; Xiangxiang Zeng

doi:10.1093/bioinformatics/btaa915

Minirmd: accurate and fast duplicate removal tool for short reads via multiple minimizers

Bioinformatics. 2021 Jul 12;37(11):1604-1606. doi: 10.1093/bioinformatics/btaa915.

Authors

Yuansheng Liu¹, Xiaocai Zhang², Quan Zou³, Xiangxiang Zeng¹

Affiliations

¹ College of Information Science and Engineering, Hunan University, Changsha, Hunan 410012, China.
² Advanced Analytics Institute, University of Technology Sydney, Broadway, NSW 2007, Australia.
³ Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China.

PMID: 33112385
DOI: 10.1093/bioinformatics/btaa915

Abstract

Summary: Removing duplicate and near-duplicate reads, generated by high-throughput sequencing technologies, is able to reduce computational resources in downstream applications. Here we develop minirmd, a de novo tool to remove duplicate reads via multiple rounds of clustering using different length of minimizer. Experiments demonstrate that minirmd removes more near-duplicate reads than existing clustering approaches and is faster than existing multi-core tools. To the best of our knowledge, minirmd is the first tool to remove near-duplicates on reverse-complementary strand.

Availability and implementation: https://github.com/yuansliu/minirmd.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Cluster Analysis
High-Throughput Nucleotide Sequencing
Sequence Analysis, DNA
Software*

Grants and funding

61872309/National Natural Science Foundation of China