Efficient learning from big data for cancer risk modeling: A case study with melanoma

Comput Biol Med. 2019 Jul:110:29-39. doi: 10.1016/j.compbiomed.2019.04.039. Epub 2019 Apr 30.

Abstract

Background: Building cancer risk models from real-world data requires overcoming challenges in data preprocessing, efficient representation, and computational performance. We present a case study of a cloud-based approach to learning from de-identified electronic health record data and demonstrate its effectiveness for melanoma risk prediction.

Methods: We used a hybrid distributed and non-distributed approach to computing in the cloud: distributed processing with Apache Spark for data preprocessing and labeling, and non-distributed processing for machine learning model training with scikit-learn. Moreover, we explored the effects of sampling the training dataset to improve computational performance. Risk factors were evaluated using regression weights as well as tree SHAP values.

Results: Among 4,061,172 patients who did not have melanoma through the 2016 calendar year, 10,129 were diagnosed with melanoma within one year. A gradient-boosted classifier achieved the best predictive performance with cross-validation (AUC = 0.799, Sensitivity = 0.753, Specificity = 0.688). Compared to a model built on the original data, a dataset two orders of magnitude smaller could achieve statistically similar or better performance with less than 1% of the training time and cost.

Conclusions: We produced a model that can effectively predict melanoma risk for a diverse dermatology population in the U.S. by using hybrid computing infrastructure and data sampling. For this de-identified clinical dataset, sampling approaches significantly shortened the time for model building while retaining predictive accuracy, allowing for more rapid machine learning model experimentation on familiar computing machinery. A large number of risk factors (>300) were required to produce the best model.

Keywords: Big data; Cloud computing; Early detection of cancer; Electronic health records; Machine learning.

MeSH terms

  • Big Data*
  • Electronic Health Records*
  • Humans
  • Machine Learning*
  • Melanoma* / epidemiology
  • Melanoma* / metabolism
  • Melanoma* / pathology
  • Models, Biological*
  • Predictive Value of Tests
  • Risk Assessment
  • Risk Factors