Performance analysis of data resampling on class imbalance and classification techniques on multi-omics data for cancer classification

PLoS One. 2024 Feb 29;19(2):e0293607. doi: 10.1371/journal.pone.0293607. eCollection 2024.

Abstract

Cancer, in any of its forms, remains a significant public health concern worldwide. Advances in early detection and treatment could lead to a decline in the overall death rate from cancer in recent decades. Therefore, tumor prediction and classification play an important role in fighting cancer. This study built computational models for a joint analysis of RNA seq, copy number variation (CNV), and DNA methylation to classify normal and tumor samples across liver cancer, breast cancer, and colon adenocarcinoma from The Cancer Genome Atlas (TCGA) dataset. Total of 18 machine learning methods were evaluated based on the AUC, precision, recall, and F-measure. Besides, five techniques were compared to ameliorate problems of class imbalance in the cancer datasets. Synthetic Minority Oversampling Technique (SMOTE) demonstrated the best performance. The results indicate that the model applying Stochastic Gradient Descent (SGD) for learning binary class SVM with hinge loss has the highest classification results on liver cancer and breast cancer datasets, with accuracy over 99% and AUC greater than or equal to 0.999. For colon adenocarcinoma dataset, both SGD and Sequential Minimal Optimization (SMO) that implements John Platt's sequential minimal optimization algorithm for training a support vector machine shows an outstanding classification performance with accuracy of 100%, AUC, precision, recall, and F-measure all at 1.000.

MeSH terms

  • Adenocarcinoma*
  • Colonic Neoplasms* / genetics
  • DNA Copy Number Variations
  • Humans
  • Liver Neoplasms*
  • Multiomics

Grants and funding

The author(s) received no specific funding for this work.