Machine Learning for Early Lung Cancer Identification Using Routine Clinical and Laboratory Data

Am J Respir Crit Care Med. 2021 Aug 15;204(4):445-453. doi: 10.1164/rccm.202007-2791OC.

Abstract

Rationale: Most lung cancers are diagnosed at an advanced stage. Presymptomatic identification of high-risk individuals can prompt earlier intervention and improve long-term outcomes. Objectives: To develop a model to predict a future diagnosis of lung cancer on the basis of routine clinical and laboratory data by using machine learning. Methods: We assembled data from 6,505 case patients with non-small cell lung cancer (NSCLC) and 189,597 contemporaneous control subjects and compared the accuracy of a novel machine learning model with a modified version of the well-validated 2012 Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial risk model (mPLCOm2012), by using the area under the receiver operating characteristic curve (AUC), sensitivity, and diagnostic odds ratio (OR) as measures of model performance. Measurements and Main Results: Among ever-smokers in the test set, a machine learning model was more accurate than the mPLCOm2012 for identifying NSCLC 9-12 months before clinical diagnosis (P < 0.00001) and demonstrated an AUC of 0.86, a diagnostic OR of 12.3, and a sensitivity of 40.1% at a predefined specificity of 95%. In comparison, the mPLCOm2012 demonstrated an AUC of 0.79, an OR of 7.4, and a sensitivity of 27.9% at the same specificity. The machine learning model was more accurate than standard eligibility criteria for lung cancer screening and more accurate than the mPLCOm2012 when applied to a screening-eligible population. Influential model variables included known risk factors and novel predictors such as white blood cell and platelet counts. Conclusions: A machine learning model was more accurate for early diagnosis of NSCLC than either standard eligibility criteria for screening or the mPLCOm2012, demonstrating the potential to help prevent lung cancer deaths through early detection.

Keywords: early detection of cancer; lung cancer; machine learning; non–small cell lung carcinoma; screening.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Aged
  • Aged, 80 and over
  • Carcinoma, Non-Small-Cell Lung / diagnosis*
  • Case-Control Studies
  • Clinical Decision Rules*
  • Early Detection of Cancer / methods*
  • Female
  • Humans
  • Lung Neoplasms / diagnosis*
  • Machine Learning*
  • Male
  • Middle Aged
  • Odds Ratio
  • ROC Curve
  • Retrospective Studies
  • Sensitivity and Specificity