Exploration of whole genome amplification generated chimeric sequences in long-read sequencing data

Brief Bioinform. 2023 Sep 20;24(5):bbad275. doi: 10.1093/bib/bbad275.

Abstract

Motivation: Multiple displacement amplification (MDA) has become the most commonly used method of whole genome amplification, generating a vast amount of DNA with higher molecular weight and greater genome coverage. Coupling with long-read sequencing, it is possible to sequence the amplicons of over 20 kb in length. However, the formation of chimeric sequences (chimeras, expressed as structural errors in sequencing data) in MDA seriously interferes with the bioinformatics analysis but its influence on long-read sequencing data is unknown.

Results: We sequenced the phi29 DNA polymerase-mediated MDA amplicons on the PacBio platform and analyzed chimeras within the generated data. The 3rd-ChimeraMiner has been constructed as a pipeline for recognizing and restoring chimeras into the original structures in long-read sequencing data, improving the efficiency of using TGS data. Five long-read datasets and one high-fidelity long-read dataset with various amplification folds were analyzed. The result reveals that the mis-priming events in amplification are more frequently occurring than widely perceived, and the propor tion gradually accumulates from 42% to over 78% as the amplification continues. In total, 99.92% of recognized chimeric sequences were demonstrated to be artifacts, whose structures were wrongly formed in MDA instead of existing in original genomes. By restoring chimeras to their original structures, the vast majority of supplementary alignments that introduce false-positive structural variants are recycled, removing 97% of inversions on average and contributing to the analysis of structural variation in MDA-amplified samples. The impact of chimeras in long-read sequencing data analysis should be emphasized, and the 3rd-ChimeraMiner can help to quantify and reduce the influence of chimeras.

Availability and implementation: The 3rd-ChimeraMiner is available on GitHub, https://github.com/dulunar/3rdChimeraMiner.

Keywords: chimeric sequence; long-reads sequencing; multiple displacement amplification.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Computational Biology*
  • DNA
  • Genome*
  • High-Throughput Nucleotide Sequencing / methods
  • Sequence Analysis, DNA / methods

Substances

  • DNA