Establishing a method of vector contamination identification in database sequences

G A Seluja; A Farmer; M McLeod; C Harger; P A Schad

doi:10.1093/bioinformatics/15.2.106

Establishing a method of vector contamination identification in database sequences

Bioinformatics. 1999 Feb;15(2):106-10. doi: 10.1093/bioinformatics/15.2.106.

Authors

G A Seluja¹, A Farmer, M McLeod, C Harger, P A Schad

Affiliation

¹ National Center for Genome Resources, 1800-A Old Pecos Trail, Santa Fe, NM 87505, USA.

PMID: 10089195
DOI: 10.1093/bioinformatics/15.2.106

Abstract

Motivation: The nucleotide sequence databases are invaluable tools both for the private and the academic research communities, from the retrieval of sequences to homology searching. Several issues related to data quality, such as the existence of sequencing artifacts and errors, are facing the databases. We investigated a major source of these errors, i.e. the presence of vector-contaminated sequences.

Results: Using a panel of 180 vector polylinker sequences, we found 0.36% or 3029 vector-matching sequences in GenBank Release 95-96, with an average vector-matching length of 72 nucleotides. The number of vector-contaminated sequences has been growing with the database; however, the percent contamination has remained approximately constant at an average of 0.28% from 1982 to 1996.

Availability: Access to the database of vector polylinker sequences via sequence similarity searching is available at http://seqsim.ncgr.org/vector/

Contact: gas@molinfo.com

Publication types

Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Animals
Base Sequence
Cloning, Molecular
DNA / genetics
Databases, Factual*
Genetic Vectors*
Humans
Sequence Analysis, DNA

Substances

DNA