Predicting second breast cancer among women with primary breast cancer using machine learning algorithms, a population-based observational study

Int J Cancer. 2023 Sep 1;153(5):932-941. doi: 10.1002/ijc.34568. Epub 2023 May 27.

Abstract

Breast cancer survivors often experience recurrence or a second primary cancer. We developed an automated approach to predict the occurrence of any second breast cancer (SBC) using patient-level data and explored the generalizability of the models with an external validation data source. Breast cancer patients from the cancer registry of Zurich, Zug, Schaffhausen, Schwyz (N = 3213; training dataset) and the cancer registry of Ticino (N = 1073; external validation dataset), diagnosed between 2010 and 2018, were used for model training and validation, respectively. Machine learning (ML) methods, namely a feed-forward neural network (ANN), logistic regression, and extreme gradient boosting (XGB) were employed for classification. The best-performing model was selected based on the receiver operating characteristic (ROC) curve. Key characteristics contributing to a high SBC risk were identified. SBC was diagnosed in 6% of all cases. The most important features for SBC prediction were age at incidence, year of birth, stage, and extent of the pathological primary tumor. The ANN model had the highest area under the ROC curve with 0.78 (95% confidence interval [CI] 0.750.82) in the training data and 0.70 (95% CI 0.61-0.79) in the external validation data. Investigating the generalizability of different ML algorithms, we found that the ANN generalized better than the other models on the external validation data. This research is a first step towards the development of an automated tool that could assist clinicians in the identification of women at high risk of developing an SBC and potentially preventing it.

Keywords: breast cancer; cancer registry; machine learning; prediction; second cancer.

Publication types

  • Observational Study

MeSH terms

  • Algorithms
  • Breast
  • Breast Neoplasms* / diagnosis
  • Breast Neoplasms* / epidemiology
  • Female
  • Humans
  • Machine Learning
  • Neural Networks, Computer