Explainable machine learning approach for cancer prediction through binarilization of RNA sequencing data

PLoS One. 2024 May 10;19(5):e0302947. doi: 10.1371/journal.pone.0302947. eCollection 2024.

Abstract

In recent years, researchers have proven the effectiveness and speediness of machine learning-based cancer diagnosis models. However, it is difficult to explain the results generated by machine learning models, especially ones that utilized complex high-dimensional data like RNA sequencing data. In this study, we propose the binarilization technique as a novel way to treat RNA sequencing data and used it to construct explainable cancer prediction models. We tested our proposed data processing technique on five different models, namely neural network, random forest, xgboost, support vector machine, and decision tree, using four cancer datasets collected from the National Cancer Institute Genomic Data Commons. Since our datasets are imbalanced, we evaluated the performance of all models using metrics designed for imbalance performance like geometric mean, Matthews correlation coefficient, F-Measure, and area under the receiver operating characteristic curve. Our approach showed comparative performance while relying on less features. Additionally, we demonstrated that data binarilization offers higher explainability by revealing how each feature affects the prediction. These results demonstrate the potential of data binarilization technique in improving the performance and explainability of RNA sequencing based cancer prediction models.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Decision Trees
  • Humans
  • Machine Learning*
  • Neoplasms* / genetics
  • Neural Networks, Computer
  • ROC Curve
  • Sequence Analysis, RNA* / methods
  • Support Vector Machine

Grants and funding

The author(s) received no specific funding for this work.