polishCLR: A Nextflow Workflow for Polishing PacBio CLR Genome Assemblies

Jennifer Chang; Amanda R Stahlke; Sivanandan Chudalayandi; Benjamin D Rosen; Anna K Childers; Andrew J Severin

doi:10.1093/gbe/evad020

polishCLR: A Nextflow Workflow for Polishing PacBio CLR Genome Assemblies

Genome Biol Evol. 2023 Mar 3;15(3):evad020. doi: 10.1093/gbe/evad020.

Authors

Jennifer Chang^{1

2

3}, Amanda R Stahlke⁴, Sivanandan Chudalayandi³, Benjamin D Rosen⁵, Anna K Childers⁴, Andrew J Severin³

Affiliations

¹ USDA, Agricultural Research Service, Jamie Whitten Delta States Research Center, Genomics and Bioinformatics Research Unit, Stoneville, Mississippi.
² Oak Ridge Institute for Science and Education, Oak Ridge, Tennessee.
³ Genome Informatics Facility, Office of Biotechnology, Iowa State University, Ames.
⁴ USDA, Agricultural Research Service, Beltsville Agricultural Research Center, Bee Research Laboratory, Beltsville Maryland.
⁵ USDA, Agricultural Research Service, Beltsville Agricultural Research Center, Animal Genomics and Improvement Laboratory, Beltsville, Maryland.

Abstract

Long-read sequencing has revolutionized genome assembly, yielding highly contiguous, chromosome-level contigs. However, assemblies from some third generation long read technologies, such as Pacific Biosciences (PacBio) continuous long reads (CLR), have a high error rate. Such errors can be corrected with short reads through a process called polishing. Although best practices for polishing non-model de novo genome assemblies were recently described by the Vertebrate Genome Project (VGP) Assembly community, there is a need for a publicly available, reproducible workflow that can be easily implemented and run on a conventional high performance computing environment. Here, we describe polishCLR (https://github.com/isugifNF/polishCLR), a reproducible Nextflow workflow that implements best practices for polishing assemblies made from CLR data. PolishCLR can be initiated from several input options that extend best practices to suboptimal cases. It also provides re-entry points throughout several key processes, including identifying duplicate haplotypes in purge_dups, allowing a break for scaffolding if data are available, and throughout multiple rounds of polishing and evaluation with Arrow and FreeBayes. PolishCLR is containerized and publicly available for the greater assembly community as a tool to complete assemblies from existing, error-prone long-read data.

Keywords: Nextflow; QV; assembly; genome; polish; polishCLR.

Publication types

Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Haplotypes
High-Throughput Nucleotide Sequencing*
Sequence Analysis, DNA
Workflow