Multi-instance multi-label distance metric learning for genome-wide protein function prediction

Yonghui Xu; Huaqing Min; Hengjie Song; Qingyao Wu

doi:10.1016/j.compbiolchem.2016.02.011

Multi-instance multi-label distance metric learning for genome-wide protein function prediction

Comput Biol Chem. 2016 Aug:63:30-40. doi: 10.1016/j.compbiolchem.2016.02.011. Epub 2016 Feb 13.

Authors

Yonghui Xu¹, Huaqing Min², Hengjie Song³, Qingyao Wu⁴

Affiliations

¹ School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China.
² School of Software Engineering, South China University of Technology, Guangzhou 510006, China. Electronic address: hqmin@scut.edu.cn.
³ School of Software Engineering, South China University of Technology, Guangzhou 510006, China.
⁴ School of Software Engineering, South China University of Technology, Guangzhou 510006, China. Electronic address: qyw@scut.edu.cn.

PMID: 26923212
DOI: 10.1016/j.compbiolchem.2016.02.011

Abstract

Multi-instance multi-label (MIML) learning has been proven to be effective for the genome-wide protein function prediction problems where each training example is associated with not only multiple instances but also multiple class labels. To find an appropriate MIML learning method for genome-wide protein function prediction, many studies in the literature attempted to optimize objective functions in which dissimilarity between instances is measured using the Euclidean distance. But in many real applications, Euclidean distance may be unable to capture the intrinsic similarity/dissimilarity in feature space and label space. Unlike other previous approaches, in this paper, we propose to learn a multi-instance multi-label distance metric learning framework (MIMLDML) for genome-wide protein function prediction. Specifically, we learn a Mahalanobis distance to preserve and utilize the intrinsic geometric information of both feature space and label space for MIML learning. In addition, we try to deal with the sparsely labeled data by giving weight to the labeled data. Extensive experiments on seven real-world organisms covering the biological three-domain system (i.e., archaea, bacteria, and eukaryote; Woese et al., 1990) show that the MIMLDML algorithm is superior to most state-of-the-art MIML learning algorithms.

Keywords: Distance metric learning; Genome wide; Machine learning; Multi-instance multi-label learning; Protein function prediction.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Genome*
Machine Learning*
Proteins / genetics*

Substances

Proteins