Persistence of data-driven knowledge to predict breast cancer survival

Int J Med Inform. 2019 Sep:129:303-311. doi: 10.1016/j.ijmedinf.2019.06.018. Epub 2019 Jun 21.

Abstract

Background: Machine learning predictive models for breast cancer survival can improve if they are made specific to the stage of the cancer at the time of diagnosis. However, the relevance of the clinical parameters in that prediction, and the predictive quality of these models may change over time.

Objective: To determine whether the findings on the influence of clinical parameters and the performance of machine learning models in the prediction of breast cancer survival have to be considered temporary or permanent, and if temporary what is the period of validity of the new generated knowledge.

Methods: Fifteen recently published relevant conclusions on the application of machine learning methods to predict breast cancer survival were identified. Then, the data on breast cancer in the SEER database were used to construct several data-driven models over time to predict five-year survival of breast cancer. Three different machine learning methods were used. Stage-specific models and joint models for all the stages were considered. The predictive quality of the models and the importance of clinical parameters were subjected to a persistence analysis over time in order to determine the validity and durability of these fifteen conclusions.

Results and conclusions: Only 53% of the conclusions were true for the SEER cases in 1988-2009, and only 75% of these were true over time. Relevant conclusions such as the impossibility to improve survival prediction of the most frequent stages with more data or the importance of the grade of the cancer to predict breast cancer survival of patients with distant metastasis turned to be false when subjected to a temporal analysis. Our study concludes that data-driven knowledge obtained with machine learning methods must be subject to over time validation before it can be clinically and professionally applied.

Keywords: Breast cancer; Machine learning; Overtime data analysis; SEER dataset; Survival prediction.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Breast Neoplasms / diagnosis*
  • Databases, Factual
  • Female
  • Humans
  • Machine Learning