Identification of protein-coding genes in the genome of Vibrio cholerae with more than 98% accuracy using occurrence frequencies of single nucleotides

Eur J Biochem. 2001 Aug;268(15):4261-8. doi: 10.1046/j.1432-1327.2001.02341.x.

Abstract

The published sequence of the Vibrio cholerae genome indicates that, in addition to the genes that encode proteins of known and unknown function, there are 1577 ORFs identified as conserved hypothetical or hypothetical gene candidates. Because the annotation is not 100% accurate, it is not known which of the 1577 ORFs are true protein-coding genes. In this paper, an algorithm based on the Z curve method, with sensitivity, specificity and accuracy greater than 98%, is used to solve this problem. Twenty-fold cross-validation tests show that the accuracy of the algorithm is 98.8%. A detailed discussion of the mechanism of the algorithm is also presented. It was found that 172 of the 1577 ORFs are unlikely to be protein-coding genes. The number of protein-coding genes in the V. cholerae genome was re-estimated and found to be approximately 3716. This result should be of use in microarray analysis of gene expression in the genome, because the cost of preparing chips may be somewhat decreased. A computer program was written to calculate a coding score called VCZ for gene identification in the genome. Coding/noncoding is simply determined by VCZ > 0/VCZ < 0. The program is freely available on request for academic use.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Codon
  • Databases, Factual
  • Genome, Plant*
  • Models, Theoretical
  • Oligonucleotide Array Sequence Analysis
  • Open Reading Frames*
  • Reproducibility of Results
  • Sensitivity and Specificity
  • Vibrio cholerae / genetics*

Substances

  • Codon