Fast identification of repetitive elements in biological sequences

Y Quentin; G A Fichant

doi:10.1006/jtbi.1994.1004

Fast identification of repetitive elements in biological sequences

J Theor Biol. 1994 Jan 7;166(1):51-61. doi: 10.1006/jtbi.1994.1004.

Authors

Y Quentin¹, G A Fichant

Affiliation

¹ Laboratoire de Chimie Bactérienne, Centre National de la Recherche Scientifique, Marseille, France.

PMID: 8145561
DOI: 10.1006/jtbi.1994.1004

Abstract

We have developed a fast filtering method for searching repetitive sequences in databases that allows the simultaneous identification of different families of repetitive elements during the same scanning. It discriminates between repetitive elements and non-related sequences by comparing the frequencies of k-words found in both groups of sequences. The distance used to sort out the sequences is based on a weighting of the k-words, which is obtained by performing a correspondence analysis on learning sets of correctly chosen sequences. The identification of Alu elements in human sequences is given as an illustration of the method. The Alu sequences are divided in four distinct groups of elements: the left and right monomers located on the direct and on the complementary strands. The results obtained on the test sets show that a very good discrimination is achieved with a word length of 6 b.p. Indeed, only 0.5% of the non-Alu sequences were incorrectly predicted as Alu elements for a threshold value allowing the identification of all Alu monomers. The misclassification of the different Alu monomers (1.4%) in the four groups of examples occurs only when the left and the right monomers are in the same orientation. Moreover, during the scanning of 63 GenBank sequences longer than 10 Kb, all the Alu elements were correctly identified (616 elements) and only a few non-Alu sequences were wrongly predicted as Alu elements (22 fragments). There is a real need for this kind of method since most of the repetitive elements are not annotated in the database entries. This method can then be used for a systematic screening of new sequences before their insertion in databases. It can also allow the creation of specific databases devoted to repetitive elements, which is a required step for any further analysis of those elements.

Publication types

Research Support, U.S. Gov't, Non-P.H.S.
Research Support, U.S. Gov't, P.H.S.

MeSH terms

Algorithms*
Animals
Base Sequence
Databases, Factual*
Mathematics
Molecular Sequence Data
Repetitive Sequences, Nucleic Acid*

Grants and funding

GM-37812/GM/NIGMS NIH HHS/United States