Performance analysis of data resampling on class imbalance and classification techniques on multi-omics data for cancer classification

Yuting Yang; Golrokh Mirzaei

doi:10.1371/journal.pone.0293607

Performance analysis of data resampling on class imbalance and classification techniques on multi-omics data for cancer classification

PLoS One. 2024 Feb 29;19(2):e0293607. doi: 10.1371/journal.pone.0293607. eCollection 2024.

Authors

Yuting Yang¹, Golrokh Mirzaei²

Affiliations

¹ Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio, United States of America.
² Department of Computer Science and Engineering, The Ohio State University, Marion, Ohio, United States of America.

Abstract

Cancer, in any of its forms, remains a significant public health concern worldwide. Advances in early detection and treatment could lead to a decline in the overall death rate from cancer in recent decades. Therefore, tumor prediction and classification play an important role in fighting cancer. This study built computational models for a joint analysis of RNA seq, copy number variation (CNV), and DNA methylation to classify normal and tumor samples across liver cancer, breast cancer, and colon adenocarcinoma from The Cancer Genome Atlas (TCGA) dataset. Total of 18 machine learning methods were evaluated based on the AUC, precision, recall, and F-measure. Besides, five techniques were compared to ameliorate problems of class imbalance in the cancer datasets. Synthetic Minority Oversampling Technique (SMOTE) demonstrated the best performance. The results indicate that the model applying Stochastic Gradient Descent (SGD) for learning binary class SVM with hinge loss has the highest classification results on liver cancer and breast cancer datasets, with accuracy over 99% and AUC greater than or equal to 0.999. For colon adenocarcinoma dataset, both SGD and Sequential Minimal Optimization (SMO) that implements John Platt's sequential minimal optimization algorithm for training a support vector machine shows an outstanding classification performance with accuracy of 100%, AUC, precision, recall, and F-measure all at 1.000.

Copyright: © 2024 Yang, Mirzaei. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

MeSH terms

Adenocarcinoma*
Colonic Neoplasms* / genetics
DNA Copy Number Variations
Humans
Liver Neoplasms*
Multiomics

Grants and funding

The author(s) received no specific funding for this work.