Finding keywords for intergenic and gene regions for human genome

Nucleosides Nucleotides Nucleic Acids. 2005;24(3):191-8. doi: 10.1081/NCN-55714.

Abstract

The analysis of functionally related sequences for conserved patterns is important for further research of different functional regions. This paper presents an analysis of genes and intergenic sequences from the point of view of linguistics analysis, where gene and intergenic regions are regarded as two different subjects written in the four-letter alphabet [A, C, G, T] and high-frequency simple sequences are taken as keywords. A measurement alpha[l(tau)] was introduced to describe the relative repeat ratio of simple sequences. Cutoff values were found for keywords selection. After eliminating "noise," 87 short sequences were selected as keywords for intergenic regions and 76 for gene regions.

MeSH terms

  • Base Sequence
  • Computational Biology
  • DNA, Intergenic / genetics*
  • Genome, Human*
  • Genomics / methods*
  • Human Genome Project
  • Humans
  • Linguistics / methods*
  • Models, Genetic*
  • Molecular Sequence Data

Substances

  • DNA, Intergenic