Feature selection and machine learning methods for optimal identification and prediction of subtypes in Parkinson's disease

Comput Methods Programs Biomed. 2021 Jul:206:106131. doi: 10.1016/j.cmpb.2021.106131. Epub 2021 Apr 29.

Abstract

Objectives: The present work focuses on assessment of Parkinson's disease (PD), including both PD subtype identification (unsupervised task) and prediction (supervised task). We specifically investigate optimal feature selection and machine learning algorithms for these tasks.

Methods: We selected 885 PD subjects as derived from longitudinal datasets (years 0-4; Parkinson's Progressive Marker Initiative), and investigated 981 features including motor, non-motor, and imaging features (SPECT-based radiomics features extracted using our standardized SERA software). Two different hybrid machine learning systems (HMLS) were constructed and applied to the data in order to select optimal combinations in both tasks: (i) identification of subtypes in PD (unsupervised-clustering), and (ii) prediction of these subtypes in year 4 (supervised-classification). From the original data based on years 0 (baseline) and 1, we created new datasets as inputs to the prediction task: (i,ii) CSD0 and CSD01: cross-sectional datasets from year 0 only and both years 0 & 1, respectively; (iii) TD01: timeless dataset from both years 0 & 1. In addition, PD subtype in year 4 was considered as outcome. Finally, high score features were derived via ensemble voting based on their prioritizations from feature selector algorithms (FSAs).

Results: In clustering task, the most optimal combinations (out of 981) were selected by individual FSAs to enable high correlation compared to using all features (arriving at 547). In prediction task, we were able to select optimal combinations, resulting in an accuracy >90% only for timeless dataset (TD01); there, we were able to select the most optimal combination using 77 features, directly selected by FSAs. In both tasks, however, using combination of only high score features from ensemble voting did not enable acceptable performances, showing optimal feature selection via individual FSAs to be more effective.

Conclusion: Combining non-imaging information with SPECT-based radiomics features, and optimal utilization of HMLSs, can enable robust identification of subtypes as well as appropriate prediction of these subtypes in PD patients. Moreover, use of timeless dataset, beyond cross-sectional datasets, enabled predictive accuracies over 90%. Overall, we showed that radiomics features extracted from SPECT images are important in clustering as well as prediction of PD subtypes.

Keywords: Ensemble Voting; Feature Selection Algorithms; Outcome prediction; Parkinson's disease; Radiomics; Subtype identification.

MeSH terms

  • Algorithms
  • Cross-Sectional Studies
  • Humans
  • Machine Learning
  • Parkinson Disease* / diagnostic imaging