Classification of motor vehicle crash injury severity: A hybrid approach for imbalanced data

Accid Anal Prev. 2018 Nov:120:250-261. doi: 10.1016/j.aap.2018.08.025. Epub 2018 Aug 30.

Abstract

This study aims to classify the injury severity in motor-vehicle crashes with both high accuracy and sensitivity rates. The dataset used in this study contains 297,113 vehicle crashes, obtained from the Michigan Traffic Crash Facts (MTCF) dataset, from 2016-2017. Similar to any other crash dataset, different accident severity classes are not equally represented in MTCF. To account for the imbalanced classes, several techniques have been used, including under-sampling and over-sampling. Using five classification learning models (i.e., Logistic regression, Decision tree, Neural network, Gradient boosting model, and Naïve Bayes classifier), we classify the levels of injury severity and attempt to improve the classification performance by two training-testing methods including Bootstrap aggregation (or bagging) and majority voting. Furthermore, due to the imbalance present in the dataset, we use the geometric mean (G-mean) to evaluate the classification performance. We show that the classification performance is the highest when bagging is used with decision trees, with over-sampling treatment for imbalanced data. The effect of treatments for the imbalanced data is maximized when under-sampling is combined with bagging. In addition to the original five classes of injury severity in the MTCF dataset, we consider two additional classification problems, one with two classes and the other with three classes, to (1) investigate the impact of the number of classes on the performance of classification models, and (2) enable comparing our results with the literature.

Keywords: Automated vehicle safety; Data analytics; Imbalanced data; Injury severity classification; Machine learning; Vehicle crashes.

MeSH terms

  • Accidents, Traffic / classification*
  • Accidents, Traffic / statistics & numerical data*
  • Automobile Driving / statistics & numerical data*
  • Bayes Theorem
  • Decision Trees
  • Humans
  • Logistic Models
  • Michigan / epidemiology
  • Motor Vehicles / statistics & numerical data*
  • Reproducibility of Results
  • Safety / statistics & numerical data*
  • Severity of Illness Index*
  • Wounds and Injuries / epidemiology*