Genome-wide inference of protein interaction sites: lessons from the yeast high-quality negative protein-protein interaction dataset

Nucleic Acids Res. 2008 Apr;36(6):2002-11. doi: 10.1093/nar/gkn016. Epub 2008 Feb 14.

Abstract

High-throughput studies of protein interactions may have produced, experimentally and computationally, the most comprehensive protein-protein interaction datasets in the completely sequenced genomes. It provides us an opportunity on a proteome scale, to discover the underlying protein interaction patterns. Here, we propose an approach to discovering motif pairs at interaction sites (often 3-8 residues) that are essential for understanding protein functions and helpful for the rational design of protein engineering and folding experiments. A gold standard positive (interacting) dataset and a gold standard negative (non-interacting) dataset were mined to infer the interacting motif pairs that are significantly overrepresented in the positive dataset compared to the negative dataset. Four negative datasets assembled by different strategies were evaluated and the one with the best performance was used as the gold standard negatives for further analysis. Meanwhile, to assess the efficiency of our method in detecting potential interacting motif pairs, other approaches developed previously were compared, and we found that our method achieved the highest prediction accuracy. In addition, many uncharacterized motif pairs of interest were found to be functional with experimental evidence in other species. This investigation demonstrates the important effects of a high-quality negative dataset on the performance of such statistical inference.

Publication types

  • Evaluation Study
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Data Interpretation, Statistical
  • Databases, Protein
  • Genomics*
  • Protein Interaction Domains and Motifs*
  • Protein Interaction Mapping* / standards
  • Reference Standards
  • Yeasts / genetics*