Statistical comparison of methods to estimate the error probability in short-read Illumina sequencing

J Bioinform Comput Biol. 2010 Jun;8(3):579-91. doi: 10.1142/s021972001000463x.

Abstract

As was the case in the beginning of the sequencing era, the new generation of short-read sequencing technologies still requires both accuracy of data processing methods and reliable measures of that accuracy. Inspired by the classic of the genre, the Phred method, we generalized those findings in the area of base quality value calibration. We introduce a simple, straightforward statistically established way to measure the performance of a calibrator, and to find an optimal way to assess its reliability. We illustrate the method by assessing the performance of several calibrators/predictors for Illumina, Genome Analyser 2 (GA2) data. The choice of the best predictor is based on optimization of validity, discriminative ability and discrimination power for several candidate predictors. We applied the method on data from one experimental run for genome of the phage varphiX, and found the best predictor out of ten candidates to be 'Purity', a statistics derived from corrected cluster intensities. The source code for the comparison of the predictors is available from the authors by request.

Publication types

  • Comparative Study

MeSH terms

  • Algorithms*
  • Artifacts*
  • Base Sequence
  • Chromosome Mapping / methods*
  • Data Interpretation, Statistical*
  • Molecular Sequence Data
  • Sequence Analysis, DNA / methods*
  • Software*