A simple binomial test for estimating sequencing errors in public repository 16S rRNA sequences

J Microbiol Methods. 2008 Feb;72(2):166-79. doi: 10.1016/j.mimet.2007.11.013. Epub 2007 Nov 23.

Abstract

Sequences in public databases may contain a number of sequencing errors. A double binomial model describing the distribution of indel-excluded similarity coefficients (S) among repeatedly sequenced 16S rRNA was previously developed and it produced a confidence interval of S useful for testing sequence identity among sequences of 400-bp length. We characterized patterns in sequencing errors found in nearly complete 16S rRNA sequences of Vibrionaceae as highly variable in reported sequence length and containing a small number of indels. To accommodate these characteristics, a simple binomial model for distribution of the similarity coefficient (H) that included indels was derived from the double binomial model for S. The model showed good fit to empirical data. By using either a pre-determined or bootstrapping estimated standard probability of base matching, we were able to use the exact binomial test to determine the relative level of sequencing error for a given pair of duplicated sequences. A limitation of the method is the requirement that duplicated sequences for the same template sequence be paired, but this can be overcome by using only conserved regions of 16S rRNA sequences and pairing a given sequence with its highest scoring BLAST search hit from the nr database of GenBank.

Publication types

  • Evaluation Study
  • Research Support, N.I.H., Extramural
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Bacterial Typing Techniques
  • Base Sequence
  • Confidence Intervals
  • DNA, Bacterial / genetics
  • DNA, Ribosomal / genetics
  • Databases, Nucleic Acid
  • Models, Statistical*
  • Molecular Sequence Data
  • RNA, Ribosomal, 16S / genetics*
  • Sequence Alignment
  • Sequence Analysis, DNA*
  • Sequence Homology, Nucleic Acid
  • Vibrionaceae / classification
  • Vibrionaceae / genetics*

Substances

  • DNA, Bacterial
  • DNA, Ribosomal
  • RNA, Ribosomal, 16S

Associated data

  • GENBANK/EF032498
  • GENBANK/EF032499