Error analysis of the PacBio sequencing CCS reads

Int J Biostat. 2023 May 8;19(2):439-453. doi: 10.1515/ijb-2021-0091. eCollection 2023 Nov 1.

Abstract

Third generation sequencing technologies such as Pacific Biosciences and Oxford Nanopore provide faster, cost-effective and simpler assembly process generating longer reads than the ones in the next generation sequencing. However, the error rates of these long reads are higher than those of the short reads, resulting in an error correcting process before the assembly such as using the Circular Consensus Sequencing (CCS) reads in PacBio sequencing machines. In this paper, we propose a probabilistic model for the error occurrence along the CCS reads. We obtain the error probability of any arbitrary nucleotide as well as the base calling Phred quality score of the nucleotides along the CCS reads in terms of the number of sub-reads. Furthermore, we derive the error rate distribution of the reads in relation to the pass number. It follows the binomial distribution which can be approximated by the normal distribution for long reads. Finally, we evaluate our proposed model by comparing it with three real PacBio datasets, namely, Lambda, and E. coli genomes, and Alzheimer's disease targeted experiment.

Keywords: CCS reads accuracy; CCS reads quality; PacBio error model; sequencing noise.

MeSH terms

  • Escherichia coli* / genetics
  • Genome
  • High-Throughput Nucleotide Sequencing* / methods
  • Nucleotides
  • Sequence Analysis, DNA / methods

Substances

  • Nucleotides