CamurWeb: a classification software and a large knowledge base for gene expression data of cancer

BMC Bioinformatics. 2018 Oct 15;19(Suppl 10):354. doi: 10.1186/s12859-018-2299-7.

Abstract

Background: The high growth of Next Generation Sequencing data currently demands new knowledge extraction methods. In particular, the RNA sequencing gene expression experimental technique stands out for case-control studies on cancer, which can be addressed with supervised machine learning techniques able to extract human interpretable models composed of genes, and their relation to the investigated disease. State of the art rule-based classifiers are designed to extract a single classification model, possibly composed of few relevant genes. Conversely, we aim to create a large knowledge base composed of many rule-based models, and thus determine which genes could be potentially involved in the analyzed tumor. This comprehensive and open access knowledge base is required to disseminate novel insights about cancer.

Results: We propose CamurWeb, a new method and web-based software that is able to extract multiple and equivalent classification models in form of logic formulas ("if then" rules) and to create a knowledge base of these rules that can be queried and analyzed. The method is based on an iterative classification procedure and an adaptive feature elimination technique that enables the computation of many rule-based models related to the cancer under study. Additionally, CamurWeb includes a user friendly interface for running the software, querying the results, and managing the performed experiments. The user can create her profile, upload her gene expression data, run the classification analyses, and interpret the results with predefined queries. In order to validate the software we apply it to all public available RNA sequencing datasets from The Cancer Genome Atlas database obtaining a large open access knowledge base about cancer. CamurWeb is available at http://bioinformatics.iasi.cnr.it/camurweb .

Conclusions: The experiments prove the validity of CamurWeb, obtaining many classification models and thus several genes that are associated to 21 different cancer types. Finally, the comprehensive knowledge base about cancer and the software tool are released online; interested researchers have free access to them for further studies and to design biological experiments in cancer research.

Keywords: Big data; Cancer; Classification; Knowledge extraction.

MeSH terms

  • Base Sequence
  • Gene Expression Regulation, Neoplastic*
  • Genes, Neoplasm
  • Genome, Human
  • Humans
  • Knowledge Bases*
  • Neoplasms / genetics*
  • Sequence Analysis, RNA
  • Software*