Gene functional annotation by statistical analysis of biomedical articles

Int J Med Inform. 2007 Aug;76(8):601-13. doi: 10.1016/j.ijmedinf.2006.04.011. Epub 2006 Jun 14.

Abstract

Background: Functional annotation of genes is an important task in biology since it facilitates the characterization of genes relationships and the understanding of biochemical pathways. The various gene functions can be described by standardized and structured vocabularies, called bio-ontologies. The assignment of bio-ontology terms to genes is carried out by means of applying certain methods to datasets extracted from biomedical articles. These methods originate from data mining and machine learning and include maximum entropy or support vector machines (SVM).

Purpose: The aim of this paper is to propose an alternative to the existing methods for functionally annotating genes. The methodology involves building of classification models, validation and graphical representations of the results and reduction of the dimensions of the dataset.

Methods: Classification models are constructed by Linear discriminant analysis (LDA). The validation of the models is based on statistical analysis and interpretation of the results involving techniques like hold-out samples, test datasets and metrics like confusion matrix, accuracy, recall, precision and F-measure. Graphical representations, such as boxplots, Andrew's curves and scatterplots of the variables resulting from the classification models are also used for validating and interpreting the results.

Results: The proposed methodology was applied to a dataset extracted from biomedical articles for 12 Gene Ontology terms. The validation of the LDA models and the comparison with the SVM show that LDA (mean F-measure 75.4%) outperforms the SVM (mean F-measure 68.7%) for the specific data.

Conclusion: The application of certain statistical methods can be beneficial for functional gene annotation from biomedical articles. Apart from the good performance the results can be interpreted and give insight of the bio-text data structure.

MeSH terms

  • Abstracting and Indexing*
  • Artificial Intelligence
  • Biomedical Research*
  • Databases, Bibliographic*
  • Discriminant Analysis*
  • Genes / physiology*
  • Peer Review, Research*
  • Reproducibility of Results