An integrated data- and theory-driven crash severity model

Dongjie Liu; Dawei Li; N N Sze; Hongliang Ding; Yuchen Song

doi:10.1016/j.aap.2023.107282

An integrated data- and theory-driven crash severity model

Accid Anal Prev. 2023 Dec:193:107282. doi: 10.1016/j.aap.2023.107282. Epub 2023 Sep 16.

Authors

Dongjie Liu¹, Dawei Li², N N Sze³, Hongliang Ding⁴, Yuchen Song¹

Affiliations

¹ School of Transportation, Southeast University, Nanjing, Jiangsu 211189, China.
² School of Transportation, Southeast University, Nanjing, Jiangsu 211189, China; Jiangsu Key Laboratory of Urban ITS, Southeast University, Nanjing, Jiangsu 211189, China; Jiangsu Province Collaborative Innovation Center of Modern Urban Traffic Technologies, Nanjing, Jiangsu 211189, China. Electronic address: lidawei@seu.edu.cn.
³ Department of Civil and Environmental Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China.
⁴ Department of Civil and Environmental Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China; Institute of Smart City and Intelligent Transporttaion, Institute of Urban Rail Transportation, Southwest Jiaotong University, Chengdu, Sichuan 611756, China.

PMID: 37722256
DOI: 10.1016/j.aap.2023.107282

Abstract

For crash severity modeling, researchers typically view theory-driven models and data-driven models as different or even conflicting approaches. The reason is that the machine-learning models offer good predictability but weak interpretability, while the latter has robust interpretability but moderate predictability. In order to alleviate the tension between them, this study proposes an integrated data- and theory-driven crash-severity model, known as Embedded Fusion model based on Text Vector Representations (TVR-EF), by leveraging the complementary strengths of both. The model specification consists of two parts. (i) the data-driven component not only mitigate the deficiencies of traditional econometric models, where one-hot encoding is frequently used and makes it impossible to observe semantic relatedness between variable categories, but also enhances the interpretability for the relationship between crash severity and potential influencing factors using the learned embedding weight matrix. (ii) In the theory-driven component, the multinomial logit model is implemented as a 2D-Convolutional Neural Network (2D-CNN) to increase flexibility and decrease dependency on prior knowledge for different crash-severity outcomes. A crash dataset from Guangdong Province, China, is utilized to estimate the TVR-EF model, which is then benchmarked against two traditional econometric models and three widely used machine-learning models. Results indicate that TVR-EF model does not only improve the predictive performance but also makes it easier to interpret.

Keywords: Crash severity; Data- and theory-driven model; Embedding representations; Interpretable machine learning; Logit model.