Is newer better?--evaluating the effects of data curation on integrated analyses in Saccharomyces cerevisiae

Integr Biol (Camb). 2012 Jul;4(7):715-27. doi: 10.1039/c2ib00123c. Epub 2012 Apr 23.

Abstract

Recent high-throughput experiments have produced a wealth of heterogeneous datasets, each of which provides information about different aspects of the cell. Consequently, integration of diverse data types is essential in order to address many biological questions. The quality of any integrated analysis system is dependent upon the quality of its component data, and upon the Gold Standard data used to evaluate it. It is commonly assumed that the quality of data improves as databases grow and change, particularly for manually curated databases. However, the validity of this assumption can be questioned, given the constant changes in the data coupled with the high level of noise associated with high-throughput experimental techniques. One of the most powerful approaches to data integration is the use of Probabilistic Functional Integrated Networks (PFINs). Here, we systematically analyse the changes in four highly-curated and widely-used online databases and evaluate the extent to which these changes affect the protein function prediction performance of PFINs in the yeast Saccharomyces cerevisiae. We find that the global trend in network performance improves over time. Where individual areas of biology are concerned, however, the most recent files do not always produce the best results. Individual datasets have unique biases towards different biological processes and by selecting and integrating relevant datasets performance can be improved. When using any type of integrated system to answer a specific biological question careful selection of raw data and Gold Standard is vital, since the most recent data may not be the most appropriate.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Area Under Curve
  • Computational Biology / methods*
  • DNA Repair
  • Data Interpretation, Statistical
  • Databases, Protein
  • False Positive Reactions
  • Models, Statistical
  • Protein Interaction Mapping / methods
  • Reproducibility of Results
  • Saccharomyces cerevisiae / genetics*
  • Saccharomyces cerevisiae / physiology*
  • Saccharomyces cerevisiae Proteins / metabolism*
  • Sensitivity and Specificity

Substances

  • Saccharomyces cerevisiae Proteins