Exclusive sequences of different genomes

Sergey I Mitrofanov; Alexander Y Panchin; Sergei A Spirin; Andrei V Alexeevski; Yuri V Panchin

doi:10.1142/S0219720010004719

Exclusive sequences of different genomes

J Bioinform Comput Biol. 2010 Jun;8(3):519-34. doi: 10.1142/S0219720010004719.

Authors

Sergey I Mitrofanov¹, Alexander Y Panchin, Sergei A Spirin, Andrei V Alexeevski, Yuri V Panchin

Affiliation

¹ Faculty of Bioengineering and Bioinformatics, Moscow State University, Moscow, Russia. mitroser04@mail.ru

PMID: 20556860
DOI: 10.1142/S0219720010004719

Abstract

We studied the distribution of 1-7 bp words in a dataset that includes 139 complete eukaryotic genomes, 33 masked eukaryotic genomes and coding regions from 35 genomes. We tested different statistical models to determine over- and under-represented words. The method described by Karlin et al. has the strongest predictive power compared to other methods. Using this method we identified over- and under-represented words consistent within a large array of taxonomic groups. Some of those words have not yet been described as exclusive. For example, CGCG is over-represented in CG-deficient organisms. We also describe exceptions for widely known exclusive words, such as CG and TA.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Animals
Base Sequence
Chromosome Mapping / methods*
Genome / genetics*
Humans
Molecular Sequence Data
Sequence Alignment / methods*
Sequence Analysis, DNA / methods*