The Machine Learning Model for Distinguishing Pathological Subtypes of Non-Small Cell Lung Cancer

Hongyue Zhao; Yexin Su; Mengjiao Wang; Zhehao Lyu; Peng Xu; Yuying Jiao; Linhan Zhang; Wei Han; Lin Tian; Peng Fu

doi:10.3389/fonc.2022.875761

The Machine Learning Model for Distinguishing Pathological Subtypes of Non-Small Cell Lung Cancer

Front Oncol. 2022 May 26:12:875761. doi: 10.3389/fonc.2022.875761. eCollection 2022.

Authors

Hongyue Zhao¹, Yexin Su², Mengjiao Wang¹, Zhehao Lyu¹, Peng Xu¹, Yuying Jiao¹, Linhan Zhang¹, Wei Han¹, Lin Tian³, Peng Fu¹

Affiliations

¹ Department of Nuclear Medicine, The First Affiliated Hospital of Harbin Medical University, Harbin, China.
² Department of Magnetic Resonance, The First Affiliated Hospital of Harbin Medical University, Harbin, China.
³ Department of Pathology, The First Affiliated Hospital of Harbin Medical University, Harbin, China.

Abstract

Purpose: Machine learning models were developed and validated to identify lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) using clinical factors, laboratory metrics, and 2-deoxy-2[¹⁸F]fluoro-D-glucose ([¹⁸F]F-FDG) positron emission tomography (PET)/computed tomography (CT) radiomic features.

Methods: One hundred and twenty non-small cell lung cancer (NSCLC) patients (62 LUAD and 58 LUSC) were analyzed retrospectively and randomized into a training group (n = 85) and validation group (n = 35). A total of 99 feature parameters-four clinical factors, four laboratory indicators, and 91 [¹⁸F]F-FDG PET/CT radiomic features-were used for data analysis and model construction. The Boruta algorithm was used to screen the features. The retained minimum optimal feature subset was input into ten machine learning to construct a classifier for distinguishing between LUAD and LUSC. Univariate and multivariate analyses were used to identify the independent risk factors of the NSCLC subtype and constructed the Clinical model. Finally, the area under the receiver operating characteristic curve (AUC) values, sensitivity, specificity, and accuracy (ACC) was used to validate the machine learning model with the best performance effect and Clinical model in the validation group, and the DeLong test was used to compare the model performance.

Results: Boruta algorithm selected the optimal subset consisting of 13 features, including two clinical features, two laboratory indicators, and nine PEF/CT radiomic features. The Random Forest (RF) model and Support Vector Machine (SVM) model in the training group showed the best performance. Gender (P=0.018) and smoking status (P=0.011) construct the Clinical model. In the validation group, the SVM model (AUC: 0.876, ACC: 0.800) and RF model (AUC: 0.863, ACC: 0.800) performed well, while Clinical model (AUC:0.712, ACC: 0.686) performed moderately. There was no significant difference between the RF and Clinical models, but the SVM model was significantly better than the Clinical model.

Conclusions: The proposed SVM and RF models successfully identified LUAD and LUSC. The results indicate that the proposed model is an accurate and noninvasive predictive tool that can assist clinical decision-making, especially for patients who cannot have biopsies or where a biopsy fails.

Keywords: [18F]F-FDG PET/CT; lung adenocarcinoma; lung squamous cell carcinoma; machine learning; radiomics.