Sparse Markov chain-based semi-supervised multi-instance multi-label method for protein function prediction

J Bioinform Comput Biol. 2015 Oct;13(5):1543001. doi: 10.1142/S0219720015430015. Epub 2015 Sep 16.

Abstract

Automated assignment of protein function has received considerable attention in recent years for genome-wide study. With the rapid accumulation of genome sequencing data produced by high-throughput experimental techniques, the process of manually predicting functional properties of proteins has become increasingly cumbersome. Such large genomics data sets can only be annotated computationally. However, automated assignment of functions to unknown protein is challenging due to its inherent difficulty and complexity. Previous studies have revealed that solving problems involving complicated objects with multiple semantic meanings using the multi-instance multi-label (MIML) framework is effective. For the protein function prediction problems, each protein object in nature may associate with distinct structural units (instances) and multiple functional properties (class labels) where each unit is described by an instance and each functional property is considered as a class label. Thus, it is convenient and natural to tackle the protein function prediction problem by using the MIML framework. In this paper, we propose a sparse Markov chain-based semi-supervised MIML method, called Sparse-Markov. A sparse transductive probability graph is constructed to encode the affinity information of the data based on ensemble of Hausdorff distance metrics. Our goal is to exploit the affinity between protein objects in the sparse transductive probability graph to seek a sparse steady state probability of the Markov chain model to do protein function prediction, such that two proteins are given similar functional labels if they are close to each other in terms of an ensemble Hausdorff distance in the graph. Experimental results on seven real-world organism data sets covering three biological domains show that our proposed Sparse-Markov method is able to achieve better performance than four state-of-the-art MIML learning algorithms.

Keywords: Hausdorff distance; Markov chain; Protein function prediction; multi-instance multi-label learning; semi-supervised learning.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Animals
  • Computational Biology
  • Databases, Protein / statistics & numerical data
  • Genome-Wide Association Study / statistics & numerical data
  • Markov Chains*
  • Proteins / chemistry*
  • Proteins / genetics
  • Proteins / physiology*
  • Supervised Machine Learning*

Substances

  • Proteins