Mining features for biomedical data using clustering tree ensembles

J Biomed Inform. 2018 Sep:85:40-48. doi: 10.1016/j.jbi.2018.07.012. Epub 2018 Jul 29.

Abstract

The volume of biomedical data available to the machine learning community grows very rapidly. A rational question is how informative these data really are or how discriminant the features describing the data instances are. Several biomedical datasets suffer from lack of variance in the instance representation, or even worse, contain instances with identical features and different class labels. Indisputably, this directly affects the performance of machine learning algorithms, as well as the ability to interpret their results. In this article, we emphasize on the aforementioned problem and propose a target-informed feature induction method based on tree ensemble learning. The method brings more variance into the data representation, thereby potentially increasing predictive performance of a learner applied to the induced features. The contribution of this article is twofold. Firstly, a problem affecting the quality of biomedical data is highlighted, and secondly, a method to handle that problem is proposed. The efficiency of the presented approach is validated on multi-target prediction tasks. The obtained results indicate that the proposed approach is able to boost the discrimination between the data instances and increase the predictive performance.

Keywords: Biomedical data mining; Extremely randomized trees; Tree-embeddings; Tree-ensembles.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Cluster Analysis*
  • Computational Biology
  • Data Mining / methods*
  • Databases, Factual / statistics & numerical data
  • Decision Trees*
  • Escherichia coli / genetics
  • Escherichia coli / metabolism
  • Gene Regulatory Networks
  • Humans
  • Machine Learning*
  • Metabolic Networks and Pathways
  • Protein Interaction Maps
  • Saccharomyces cerevisiae / genetics
  • Saccharomyces cerevisiae / metabolism