Short Read Alignment Based on Maximal Approximate Match Seeds

Wei Quan; Dengfeng Guan; Guangri Quan; Bo Liu; Yadong Wang

doi:10.3389/fmolb.2020.572934

Short Read Alignment Based on Maximal Approximate Match Seeds

Front Mol Biosci. 2020 Nov 5:7:572934. doi: 10.3389/fmolb.2020.572934. eCollection 2020.

Authors

Wei Quan¹, Dengfeng Guan^{1

2}, Guangri Quan¹, Bo Liu¹, Yadong Wang¹

Affiliations

¹ School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.
² Institute of Zoology, Chinese Academy of Sciences, Beijing, China.

Abstract

Sequence alignment is a critical step in many critical genomic studies, such as variant calling, quantitative transcriptome analysis (RNA-seq), and metagenomic sequence classification. However, the alignment performance is largely affected by repetitive sequences in the reference genome, which extensively exist in species from bacteria to mammals. Aligning repeating sequences might lead to tremendous candidate locations, bringing about a challenging computational burden. Thus, most alignment tools prefer to simply discard highly repetitive seeds, but this may cause the true alignment to be missed. Using maximal approximate matches (MAMs) as seeds is an option, but MEMs seeds may fail due to sequencing errors or genomic variations in MEMs seeds. Here, we propose a novel sequence alignment algorithm, named MAM, which can efficiently align short DNA sequences. MAM first builds a modified Burrows-Wheeler transform (BWT) structure of a reference genome to accelerate approximate seed matching. Then, MAM uses maximal approximate matches (MAMs) seeds to reduce the candidate locations. Finally, MAM applies an affine-gap-penalty dynamic programming to extend MAMs seeds. Experimental results on simulated and real sequencing datasets show that MAM achieves better performance in speed than other state-of-the-art alignment tools. The source code is available at https://github.com/weiquan/mam.

Keywords: FM-index; next-generation sequencing; repeats; sequence alignment; whole-genome resequencing.