The context-tree kernel for strings

Marco Cuturi; Jean-Philippe Vert

doi:10.1016/j.neunet.2005.07.010

The context-tree kernel for strings

Neural Netw. 2005 Oct;18(8):1111-23. doi: 10.1016/j.neunet.2005.07.010. Epub 2005 Sep 27.

Authors

Marco Cuturi¹, Jean-Philippe Vert

Affiliation

¹ Computational Biology Group, Ecole des Mines de Paris, 35 rue Saint Honoré, 77300 Fontainebleau, France. marco.cuturi@ensmp.fr

PMID: 16198086
DOI: 10.1016/j.neunet.2005.07.010

Abstract

We propose a new kernel for strings which borrows ideas and techniques from information theory and data compression. This kernel can be used in combination with any kernel method, in particular Support Vector Machines for string classification, with notable applications in proteomics. By using a Bayesian averaging framework with conjugate priors on a class of Markovian models known as probabilistic suffix trees or context-trees, we compute the value of this kernel in linear time and space while only using the information contained in the spectrum of the considered strings. This is ensured through an adaptation of a compression method known as the context-tree weighting algorithm. Encouraging classification results are reported on a standard protein homology detection experiment, showing that the context-tree kernel performs well with respect to other state-of-the-art methods while using no biological prior knowledge.

Publication types

Comparative Study
Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Artificial Intelligence*
Bayes Theorem
Data Compression / methods*
Databases, Protein
Markov Chains
Neural Networks, Computer*
Sequence Analysis, Protein
Sequence Homology, Amino Acid

Grants and funding

R33 HG003070/HG/NHGRI NIH HHS/United States