A novel method based on physicochemical properties of amino acids and one class classification algorithm for disease gene identification

J Biomed Inform. 2015 Aug:56:300-6. doi: 10.1016/j.jbi.2015.06.018. Epub 2015 Jul 2.

Abstract

Identifying the genes that cause disease is one of the most challenging issues to establish the diagnosis and treatment quickly. Several interesting methods have been introduced for disease gene identification for a decade. In general, the main differences between these methods are the type of data used as a prior-knowledge, as well as machine learning (ML) methods used for identification. The disease gene identification task has been commonly viewed by ML methods as a binary classification problem (whether any gene is disease or not). However, the nature of the data (since there is no negative data available for training or leaners) creates a major problem which affect the results. In this paper, sequence-based, one class classification method is introduced to assign genes to disease class (yes, no). First, to generate feature vector, the sequences of proteins (genes) are initially transformed to numerical vector using physicochemical properties of amino acid. Second, as there is no definite approach to define non-disease genes (negative data); we have attempted to model solely disease genes (positive data) to make a prediction by employing Support Vector Data Description algorithm. The experimental results confirm the efficiency of the method with precision, recall and F-measure of 79.3%, 82.6% and 80.9%, respectively.

Keywords: Disease gene identification; One class classification; Physicochemical properties of amino acid; Support Vector Data Description.

MeSH terms

  • Algorithms
  • Amino Acid Sequence
  • Amino Acids / chemistry*
  • Artificial Intelligence*
  • Computational Biology / methods*
  • Diagnosis, Computer-Assisted / methods
  • False Positive Reactions
  • Humans
  • Parkinson Disease / metabolism
  • Principal Component Analysis
  • Probability
  • Proteins / chemistry*
  • ROC Curve
  • Regression Analysis
  • Reproducibility of Results
  • Support Vector Machine

Substances

  • Amino Acids
  • Proteins