SaAlign: Multiple DNA/RNA sequence alignment and phylogenetic tree construction tool for ultra-large datasets and ultra-long sequences based on suffix array

Ziyuan Wang; Junjie Tan; Yanling Long; Yijia Liu; Wenyan Lei; Jing Cai; Yi Yang; Zhibin Liu

doi:10.1016/j.csbj.2022.03.018

SaAlign: Multiple DNA/RNA sequence alignment and phylogenetic tree construction tool for ultra-large datasets and ultra-long sequences based on suffix array

Comput Struct Biotechnol J. 2022 Mar 21:20:1487-1493. doi: 10.1016/j.csbj.2022.03.018. eCollection 2022.

Authors

Ziyuan Wang¹, Junjie Tan², Yanling Long³, Yijia Liu¹, Wenyan Lei¹, Jing Cai⁴, Yi Yang¹, Zhibin Liu¹

Affiliations

¹ Key Laboratory of Bio-Resource and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu 610064, Sichuan, PR China.
² Center for Clinical Molecular Medicine, National Clinical Research Center for Child Health and Disorders, Ministry of Education Key Laboratory of Child Development and Disorders, China International Science and Technology Cooperation Base of Child Development and Critical Disorders, Chongqing Key Laboratory of Pediatrics, Children's Hospital of Chongqing Medical University, Chongqing 400014, PR China.
³ College of Computer Science, Sichuan University, Chengdu 610064, Sichuan, PR China.
⁴ West China School of Pharmacy, Sichuan University, Chengdu 610041, Sichuan, PR China.

Abstract

Multiple DNA/RNA sequence alignment is an important fundamental tool in bioinformatics, especially for phylogenetic tree construction. With DNA-sequencing improvements, the amount of bioinformatics data is constantly increasing, and various tools need to be iterated constantly. Mitochondrial genome analyses of multiple individuals and species require bioinformatics software; therefore, their performances need to be optimized. To improve the alignment of ultra-large datasets and ultra-long sequences, we optimized a dynamic programming algorithm using longest common substring methods. Ultra-large test DNA datasets, containing sequences of different lengths, some over 300 kb (kilobase), revealed that the Multiple DNA/RNA Sequence Alignment Tool Based on Suffix Tree (SaAlign) saved time and computational space. It outperformed the existing technical tools, including MAFFT and HAlign-II. For mitochondrial genome datasets having limited numbers of sequences, MAFFT performed the required tasks, but it could not handle ultra-large mitochondrial genome datasets for core dump error. We implement a multiple DNA/RNA sequence alignment tool based on Center Star strategy and use suffix array algorithm to optimize the spatial and time efficiency. Nowadays, whole-genome research and NGS technology are becoming more popular, and it is necessary to save computational resources for laboratories. That software is of great significance in these aspects, especially in the study of the whole-mitochondrial genome of plants.

Keywords: Alignment; DP, Dynamic programming; LCS, Longest common subsequence; MSA, Multiple sequence alignment; Phylogenetic tree; SA, Suffix array; Sequence analysis; Suffix array.