DNA sequence analysis linguistic tools: contrast vocabularies, compositional spectra and linguistic complexity

Alexander Bolshoy

DNA sequence analysis linguistic tools: contrast vocabularies, compositional spectra and linguistic complexity

Appl Bioinformatics. 2003;2(2):103-12.

Author

Alexander Bolshoy¹

Affiliation

¹ Genome Diversity Center, Institute of Evolution, University of Haifa, Haifa, Israel. bolshoy@research.haifa.ac.il

PMID: 15130826

Abstract

This is a review of the methods based on counting oligomers in nucleotide and amino acid sequences. Such methods are analogous to the formal linguistic analysis of human texts. This review includes methods based on the calculation of observed occurrences (frequencies) of oligomers and their distribution, as well as those based on deviations between the observed and the expected occurrences (contrast words, genome signatures) in biological sequences. Both types of methods have a wide range of sensitivity and can identify homologous as well as functionally and taxonomically related sequences.

Publication types

Research Support, Non-U.S. Gov't
Review

MeSH terms

Algorithms*
Documentation / methods*
Gene Expression Profiling / methods*
Linguistics
Natural Language Processing
Pattern Recognition, Automated*
Sequence Alignment / methods*
Sequence Analysis, DNA / methods*
Sequence Homology, Nucleic Acid
Vocabulary, Controlled