Speech Emotion Recognition with Heterogeneous Feature Unification of Deep Neural Network

Wei Jiang; Zheng Wang; Jesse S Jin; Xianfeng Han; Chunguang Li

doi:10.3390/s19122730

Speech Emotion Recognition with Heterogeneous Feature Unification of Deep Neural Network

Sensors (Basel). 2019 Jun 18;19(12):2730. doi: 10.3390/s19122730.

Authors

Wei Jiang^{1

2}, Zheng Wang³, Jesse S Jin⁴, Xianfeng Han⁵, Chunguang Li⁶

Affiliations

¹ College of Intelligence and Computing, Tianjin University, Tianjin 300072, China. jiangweitju@163.com.
² School of Computer Information and Engineering, Changzhou Institute of Technology, Changzhou 213032, China. jiangweitju@163.com.
³ College of Intelligence and Computing, Tianjin University, Tianjin 300072, China. wzheng@tju.edu.cn.
⁴ College of Intelligence and Computing, Tianjin University, Tianjin 300072, China. jinsheng@tju.edu.cn.
⁵ College of Intelligence and Computing, Tianjin University, Tianjin 300072, China. hanxianf@163.com.
⁶ School of Computer Information and Engineering, Changzhou Institute of Technology, Changzhou 213032, China. licg@czu.cn.

Abstract

Automatic speech emotion recognition is a challenging task due to the gap between acoustic features and human emotions, which rely strongly on the discriminative acoustic features extracted for a given recognition task. We propose a novel deep neural architecture to extract the informative feature representations from the heterogeneous acoustic feature groups which may contain redundant and unrelated information leading to low emotion recognition performance in this work. After obtaining the informative features, a fusion network is trained to jointly learn the discriminative acoustic feature representation and a Support Vector Machine (SVM) is used as the final classifier for recognition task. Experimental results on the IEMOCAP dataset demonstrate that the proposed architecture improved the recognition performance, achieving accuracy of 64% compared to existing state-of-the-art approaches.

Keywords: deep neural architecture; fusion network; heterogeneous feature unification; human–computer interaction (HCI); speech emotion recognition.

Abstract

Grants and funding