RAfilter: an algorithm for detecting and filtering false-positive alignments in repetitive genomic regions

Hortic Res. 2022 Dec 29;10(1):uhac288. doi: 10.1093/hr/uhac288. eCollection 2023 Jan.

Abstract

Telomere to telomere (T2T) assembly relies on the correctness of sequence alignments. However, the existing aligners tend to generate a high proportion of false-positive alignments in repetitive genomic regions which impedes the generation of T2T-level reference genomes for more important species. In this paper, we present an automatic algorithm called RAfilter for removing the false-positives in the outputs of existing aligners. RAfilter takes advantage of rare k-mers representing the copy-specific features to differentiate false-positive alignments from the correct ones. Considering the huge numbers of rare k-mers in large eukaryotic genomes, a series of high-performance computing techniques such as multi-threading and bit operation are used to improve the time and space efficiencies. The experimental results on tandem repeats and interspersed repeats show that RAfilter was able to filter 60%-90% false-positive HiFi alignments with almost no correct ones removed, while the sensitivities and precisions on ONT datasets were about 80% and 50% respectively.