Genome sequence assembly algorithms and misassembly identification methods

Yue Meng; Yu Lei; Jianlong Gao; Yuxuan Liu; Enze Ma; Yunhong Ding; Yixin Bian; Hongquan Zu; Yucui Dong; Xiao Zhu

doi:10.1007/s11033-022-07919-8

Genome sequence assembly algorithms and misassembly identification methods

Mol Biol Rep. 2022 Nov;49(11):11133-11148. doi: 10.1007/s11033-022-07919-8. Epub 2022 Sep 23.

Authors

Yue Meng^#¹, Yu Lei^#², Jianlong Gao³, Yuxuan Liu³, Enze Ma³, Yunhong Ding³, Yixin Bian³, Hongquan Zu⁴, Yucui Dong⁵, Xiao Zhu⁶

Affiliations

¹ School of Information Engineering, Zhengzhou University of Industrial Technology, Zhengzhou, Henan, China.
² Department of Big Data and Intelligent Engineering, Shanxi Institute of Technology, Yangquan, Shanxi, China.
³ School of Computer Science and Information Engineering, Harbin Normal University, Harbin, Heilongjiang, China.
⁴ Center of Network and Information, Harbin Institute of Technology, Harbin, Heilongjiang, China.
⁵ Department of Immunology, Binzhou Medical University, Yantai, Shandong, China. dongyucui521@yeah.net.
⁶ School of Computer and Control Engineering, Yantai University, Yantai, Shandong, China. zhuxiao_hit@yeah.net.

^# Contributed equally.

PMID: 36151399
DOI: 10.1007/s11033-022-07919-8

Abstract

The sequence assembly algorithms have rapidly evolved with the vigorous growth of genome sequencing technology over the past two decades. Assembly mainly uses the iterative expansion of overlap relationships between sequences to construct the target genome. The assembly algorithms can be typically classified into several categories, such as the Greedy strategy, Overlap-Layout-Consensus (OLC) strategy, and de Bruijn graph (DBG) strategy. In particular, due to the rapid development of third-generation sequencing (TGS) technology, some prevalent assembly algorithms have been proposed to generate high-quality chromosome-level assemblies. However, due to the genome complexity, the length of short reads, and the high error rate of long reads, contigs produced by assembly may contain misassemblies adversely affecting downstream data analysis. Therefore, several read-based and reference-based methods for misassembly identification have been developed to improve assembly quality. This work primarily reviewed the development of DNA sequencing technologies and summarized sequencing data simulation methods, sequencing error correction methods, various mainstream sequence assembly algorithms, and misassembly identification methods. A large amount of computation makes the sequence assembly problem more challenging, and therefore, it is necessary to develop more efficient and accurate assembly algorithms and alternative algorithms.

Keywords: Genome assembly algorithms; Genome sequencing technology; Misassembly identification methods; Third-generation sequencing.

Publication types

Review

MeSH terms

Algorithms*
Base Sequence
Genome*
High-Throughput Nucleotide Sequencing / methods
Sequence Analysis, DNA / methods
Software

Abstract

Publication types

MeSH terms

Grants and funding