Allerdictor: fast allergen prediction using text classification techniques

Bioinformatics. 2014 Apr 15;30(8):1120-1128. doi: 10.1093/bioinformatics/btu004. Epub 2014 Jan 7.

Abstract

Motivation: Accurately identifying and eliminating allergens from biotechnology-derived products are important for human health. From a biomedical research perspective, it is also important to identify allergens in sequenced genomes. Many allergen prediction tools have been developed during the past years. Although these tools have achieved certain levels of specificity, when applied to large-scale allergen discovery (e.g. at a whole-genome scale), they still yield many false positives and thus low precision (even at low recall) due to the extreme skewness of the data (allergens are rare). Moreover, the most accurate tools are relatively slow because they use protein sequence alignment to build feature vectors for allergen classifiers. Additionally, only web server implementations of the current allergen prediction tools are publicly available and are without the capability of large batch submission. These weaknesses make large-scale allergen discovery ineffective and inefficient in the public domain.

Results: We developed Allerdictor, a fast and accurate sequence-based allergen prediction tool that models protein sequences as text documents and uses support vector machine in text classification for allergen prediction. Test results on multiple highly skewed datasets demonstrated that Allerdictor predicted allergens with high precision over high recall at fast speed. For example, Allerdictor only took ∼6 min on a single core PC to scan a whole Swiss-Prot database of ∼540 000 sequences and identified <1% of them as allergens.

Availability and implementation: Allerdictor is implemented in Python and available as standalone and web server versions at http://allerdictor.vbi.vt.edu CONTACT: lawrence@vbi.vt.edu Supplementary information: Supplementary data are available at Bioinformatics online.

MeSH terms

  • Allergens / analysis*
  • Amino Acid Sequence
  • Computational Biology
  • Databases, Protein
  • Humans
  • Proteins / analysis
  • Sequence Alignment
  • Software*
  • Support Vector Machine

Substances

  • Allergens
  • Proteins