Machine learning (ML) techniques to predict breast cancer in imbalanced datasets: a systematic review

J Cancer Surviv. 2023 Sep 26. doi: 10.1007/s11764-023-01465-3. Online ahead of print.

Abstract

Knowledge discovery in databases (KDD) is crucial in analyzing data to extract valuable insights. In medical outcome prediction, KDD is increasingly applied, particularly in diseases with high incidence, mortality, and costs, like cancer. ML techniques can develop more accurate predictive models for cancer patients' clinical outcomes, aiding informed healthcare decision-making. However, cancer prediction modeling faces challenges because of the unbalanced nature of the datasets, where there is a small minority category of patients with a cancer diagnosis compared to a majority category of cancer-free patients. Imbalanced datasets pose statistical hurdles like bias and overfitting when developing accurate prediction models. This systematic review focuses on breast cancer prediction articles published from 2008 to 2023. The objective is to examine ML methods used in three critical steps of KDD: preprocessing, data mining, and interpretation which address the imbalanced data problem in breast cancer prediction. This work synthesizes prior research in ML methods for breast cancer prediction. The findings help identify effective preprocessing strategies, including balancing and feature selection methods, robust predictive models, and evaluation metrics of those models. The study aims to inform healthcare providers and researchers about effective techniques for accurate breast cancer prediction.

Keywords: Cancer; Classification; Feature selection; Imbalanced data; Resampling.

Publication types

  • Review