Applying density-based outlier identifications using multiple datasets for validation of stroke clinical outcomes

Int J Med Inform. 2019 Dec:132:103988. doi: 10.1016/j.ijmedinf.2019.103988. Epub 2019 Oct 3.

Abstract

Introduction: Clinicians commonly use the modified Rankin Scale (mRS) and the Barthel Index (BI) to measure clinical outcome after stroke. These are potential targets in machine learning models for stroke outcome prediction. Therefore, the quality of the measurements is crucial for training and validation of these models. The objective of this study was to apply and evaluate density-based outlier detection methods for identifying potentially incorrect measurements in multiple large stroke datasets to assess the measurement quality.

Method: We applied three density-based outlier detection methods including density-based spatial clustering of applications (DBSCAN), hierarchical DBSCAN (HDBSCAN) and local outlier factor (LOF) based on a large dataset obtained from a nationwide prospective stroke registry in Taiwan. The testing of each method was done by using four different NINDS funded stroke datasets.

Result: The DBSCAN achieved a high performance across all mRS values where the highest average accuracy was 99.2 ± 0.7 at mRS of 4 and the lowest average accuracy was 92.0 ± 4.6 at mRS of 3. The LOF also achieved similar performance, however, the HDBSCAN with default parameters setting required further tuning improvement.

Conclusion: The density-based outlier detection methods were proven to be promising for validation of stroke outcome measures. The outlier detection algorithm developed from a large prospective registry dataset was effectively applied in four different NINDS stroke datasets with high performance results. The tool developed from this detection algorithm can be further applied to real world datasets to increase the data quality in stroke outcome measures.

Keywords: Barthel Index; Outlier detection; Stroke outcome; modified Rankin Scale.

Publication types

  • Research Support, N.I.H., Intramural

MeSH terms

  • Aged
  • Algorithms*
  • Cluster Analysis
  • Datasets as Topic
  • Female
  • Humans
  • Machine Learning*
  • Male
  • Outcome Assessment, Health Care / methods*
  • Outcome Assessment, Health Care / standards
  • Outcome Assessment, Health Care / statistics & numerical data*
  • Prospective Studies
  • Research Design
  • Stroke / epidemiology
  • Stroke / pathology*
  • Stroke / therapy
  • Taiwan / epidemiology
  • Treatment Outcome
  • Validation Studies as Topic