Evaluating deep learning architectures for Speech Emotion Recognition

Haytham M Fayek; Margaret Lech; Lawrence Cavedon

doi:10.1016/j.neunet.2017.02.013

Evaluating deep learning architectures for Speech Emotion Recognition

Neural Netw. 2017 Aug:92:60-68. doi: 10.1016/j.neunet.2017.02.013. Epub 2017 Mar 21.

Authors

Haytham M Fayek¹, Margaret Lech², Lawrence Cavedon³

Affiliations

¹ School of Engineering, RMIT University, Melbourne VIC 3001, Australia. Electronic address: haytham.fayek@ieee.org.
² School of Engineering, RMIT University, Melbourne VIC 3001, Australia. Electronic address: margaret.lech@rmit.edu.au.
³ School of Science, RMIT University, Melbourne VIC 3001, Australia. Electronic address: lawrence.cavedon@rmit.edu.au.

PMID: 28396068
DOI: 10.1016/j.neunet.2017.02.013

Abstract

Speech Emotion Recognition (SER) can be regarded as a static or dynamic classification problem, which makes SER an excellent test bed for investigating and comparing various deep learning architectures. We describe a frame-based formulation to SER that relies on minimal speech processing and end-to-end deep learning to model intra-utterance dynamics. We use the proposed SER system to empirically explore feed-forward and recurrent neural network architectures and their variants. Experiments conducted illuminate the advantages and limitations of these architectures in paralinguistic speech recognition and emotion recognition in particular. As a result of our exploration, we report state-of-the-art results on the IEMOCAP database for speaker-independent SER and present quantitative and qualitative assessments of the models' performances.

Keywords: Affective computing; Deep learning; Emotion recognition; Neural networks; Speech recognition.

MeSH terms

Emotions*
Machine Learning*
Neural Networks, Computer
Speech Recognition Software*