C-PUGP: A cluster-based positive unlabeled learning method for disease gene prediction and prioritization

Comput Biol Chem. 2018 Oct:76:23-31. doi: 10.1016/j.compbiolchem.2018.05.022. Epub 2018 Jun 1.

Abstract

Disease gene detection is an important stage in the understanding disease processes and treatment. Some candidate disease genes are identified using many machine learning methods Although there are some differences in these methods including feature vector of genes, the method used to selecting reliable negative data (non-disease genes), and the classification method, the lack of negative data is the most significant challenge of them. Recently, candidate disease genes are identified by semi-supervised learning methods based on positive and unlabeled data. These methods are reasonably accurate and achieved more desirable results versus preceding methods. In this article, we propose a novel Positive Unlabeled (PU) learning technique based upon clustering and One-Class classification algorithm. In this regard, unlike existing methods, we make a more Reliable Negative (RN) set in three steps: (1) Clustering positive data, (2) Learning One-Class classifier models using the clusters, and (3) Selecting intersection set of negative data as the Reliable Negative set. Next, we attempt to identify and rank the candidate disease genes using a binary classifier based on support vector machine (SVM) algorithm. Experimental results indicate that the proposed method yields to the best results, that is 92.8, 93.6, and 93.1 in terms of precision, recall, and F-measure respectively. Compared to the existing methods, the increase of performances of our proposed method is 11.7 percent better than the best method in terms of F-measure. Also, results show about 6% increase in the prioritization results.

Keywords: Candidate disease genes; Classification; Clustering; Identification; Pul; Semi-supervised learning.

MeSH terms

  • Disease / genetics*
  • Genes / genetics*
  • Genetic Predisposition to Disease / genetics*
  • Genomics / methods*
  • Machine Learning*
  • Mutation
  • Principal Component Analysis