All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning

Antti Airola; Sampo Pyysalo; Jari Björne; Tapio Pahikkala; Filip Ginter; Tapio Salakoski

doi:10.1186/1471-2105-9-S11-S2

All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning

BMC Bioinformatics. 2008 Nov 19;9 Suppl 11(Suppl 11):S2. doi: 10.1186/1471-2105-9-S11-S2.

Authors

Antti Airola¹, Sampo Pyysalo, Jari Björne, Tapio Pahikkala, Filip Ginter, Tapio Salakoski

Affiliation

¹ Turku Centre for Computer Science (TUCS) and the Department of IT, University of Turku, Joukahaisenkatu 3-5, 20520 Turku, Finland. antti.airola@utu.fi

Abstract

Background: Automated extraction of protein-protein interactions (PPI) is an important and widely studied task in biomedical text mining. We propose a graph kernel based approach for this task. In contrast to earlier approaches to PPI extraction, the introduced all-paths graph kernel has the capability to make use of full, general dependency graphs representing the sentence structure.

Results: We evaluate the proposed method on five publicly available PPI corpora, providing the most comprehensive evaluation done for a machine learning based PPI-extraction system. We additionally perform a detailed evaluation of the effects of training and testing on different resources, providing insight into the challenges involved in applying a system beyond the data it was trained on. Our method is shown to achieve state-of-the-art performance with respect to comparable evaluations, with 56.4 F-score and 84.8 AUC on the AImed corpus.

Conclusion: We show that the graph kernel approach performs on state-of-the-art level in PPI extraction, and note the possible extension to the task of extracting complex interactions. Cross-corpus results provide further insight into how the learning generalizes beyond individual corpora. Further, we identify several pitfalls that can make evaluations of PPI-extraction systems incomparable, or even invalid. These include incorrect cross-validation strategies and problems related to comparing F-score results achieved on different evaluation resources. Recommendations for avoiding these pitfalls are provided.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Artificial Intelligence
Computational Biology / methods*
Databases as Topic
Natural Language Processing
Protein Interaction Mapping / methods*