Statistical evaluation of local alignment features predicting allergenicity using supervised classification algorithms

Int Arch Allergy Immunol. 2004 Feb;133(2):101-12. doi: 10.1159/000076382. Epub 2004 Jan 21.

Abstract

Background: Recently, two promising alignment-based features predicting food allergenicity using the k nearest neighbor (kNN) classifier were reported. These features are the alignment score and alignment length of the best local alignment obtained in a database of known allergen sequences.

Methods: In the work reported here a much more comprehensive statistical evaluation of the potential of these features was performed, this time for the prediction of allergenicity in general. The evaluation consisted of the following four key components. (1) A new high quality database consisting of 318 carefully selected, non-redundant allergens and 1,007 sequences carefully selected to be non-allergens. (2) Three different supervised algorithms: the kNN classifier, the Bayesian linear Gaussian classifier, and the Bayesian quadratic Gaussian classifier. (3) A large set of local alignment procedures defined using the FASTA3 alignment program by means of a wide range of different parameter settings. (4) Novel performance curves, alternative to conventional receiver-operating characteristic curves, to display not only average behaviors but also statistical variations due to small data sets.

Results: The linear Gaussian classifier proved most useful among the tested supervised machine learning algorithms, closely followed by the quadratic Gaussian equivalent and kNN. The overall best classification results were obtained with a novel feature vector consisting of the combined alignment scores derived from local alignment procedures using different substitution matrices.

Conclusions: The models reported here should be useful as a part of an integrated assessment scheme for potential protein allergenicity and for future comparisons with alternative bioinformatic approaches.

Publication types

  • Evaluation Study
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Allergens / chemistry
  • Allergens / immunology*
  • Amino Acid Sequence
  • Computational Biology
  • Databases, Protein
  • Decision Trees
  • Food Hypersensitivity / prevention & control*
  • Food, Genetically Modified
  • Humans
  • Models, Immunological*
  • Sequence Alignment

Substances

  • Allergens