SVM classification of human intergenic and gene sequences

Y H Qiao; J L Liu; C G Zhang; X H Xu; Y J Zeng

doi:10.1016/j.mbs.2005.03.005

SVM classification of human intergenic and gene sequences

Math Biosci. 2005 Jun;195(2):168-78. doi: 10.1016/j.mbs.2005.03.005.

Authors

Y H Qiao¹, J L Liu, C G Zhang, X H Xu, Y J Zeng

Affiliation

¹ Biomechanics and Medical Information Institute, Beijing University of Technology, Beijing 100022, China.

PMID: 15893339
DOI: 10.1016/j.mbs.2005.03.005

Abstract

Despite constant improvement in prediction accuracy, gene-finding programs are still unable to provide automatic gene discovery with the desired correctness. This paper presents an analysis of gene and intergenic sequences from the point of view of language analysis, where gene and intergenic regions are regarded as two different subjects written in the four-letter alphabet {A,C,G,T}, and high frequency simple sequences are taken as keywords. A measurement alpha(l(tau)) was introduced to describe the relative repeat ratio of simple sequences. Threshold values were found for keyword selections. After eliminating 'noise', 178 short sequences were selected as keywords. DNA sequences are mapped to 178-dimensional Euclidean space, and SVM was used for prediction of gene regions. We showed by cross-validation that the program we developed could predict 93% of gene sequences with 7% false positives. When tested on a long genomic multi-gene sequence, our method improved nucleotide level specificity by 21%, and over 60% of predicted genes corresponded to actual genes.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Base Sequence*
Computational Biology / methods
DNA, Intergenic*
Genes*
Genome, Human
Humans
Linguistics / methods
Models, Genetic*

Substances

DNA, Intergenic