Identifying and removing artificial replicates from 454 pyrosequencing data

Cold Spring Harb Protoc. 2010 Apr;2010(4):pdb.prot5409. doi: 10.1101/pdb.prot5409.

Abstract

An intrinsic artifact of 454-based pyrosequencing leads to artificial overrepresentation of >10% of the original DNA sequencing templates. This artificial amplification of sequences is unbiased with regard to position on the pyrosequencing plate or sequence identity, and it occurs in all currently available 454 technologies. The amplified sequences start at the same position and are identical (duplicates), or vary in length, or contain a sequencing discrepancy. If the abundance of any sequence in a data set is going to be enumerated, either for comparative community analysis, transcriptional analysis or other applications, it is important to remove these artificial replicates before analysis. A web-based tool that incorporates the clustering algorithm cd-hit was developed to identify and remove artificially replicated sequences in 454-based pyrosequencing data sets. This tool cannot be used for data sets that have an initial amplification step before the standard pyrosequencing procedure, because artificial replicates cannot be distinguished from expected replication due to polymerase chain reaction (PCR) amplification, e.g., in sequencing of amplified gene "tags." This protocol provides details on how to use the replicate filter and obtain a file of unique sequences for use in metagenomic or transcriptomic analyses.

MeSH terms

  • Cluster Analysis
  • Computational Biology / methods*
  • Diagnostic Errors
  • Internet
  • Sequence Analysis, DNA / methods*
  • Software*
  • Statistics as Topic / methods*