SVM classification of human intergenic and gene sequences

Math Biosci. 2005 Jun;195(2):168-78. doi: 10.1016/j.mbs.2005.03.005.

Abstract

Despite constant improvement in prediction accuracy, gene-finding programs are still unable to provide automatic gene discovery with the desired correctness. This paper presents an analysis of gene and intergenic sequences from the point of view of language analysis, where gene and intergenic regions are regarded as two different subjects written in the four-letter alphabet {A,C,G,T}, and high frequency simple sequences are taken as keywords. A measurement alpha(l(tau)) was introduced to describe the relative repeat ratio of simple sequences. Threshold values were found for keyword selections. After eliminating 'noise', 178 short sequences were selected as keywords. DNA sequences are mapped to 178-dimensional Euclidean space, and SVM was used for prediction of gene regions. We showed by cross-validation that the program we developed could predict 93% of gene sequences with 7% false positives. When tested on a long genomic multi-gene sequence, our method improved nucleotide level specificity by 21%, and over 60% of predicted genes corresponded to actual genes.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Base Sequence*
  • Computational Biology / methods
  • DNA, Intergenic*
  • Genes*
  • Genome, Human
  • Humans
  • Linguistics / methods
  • Models, Genetic*

Substances

  • DNA, Intergenic