Frequency, Time, Representation and Modeling Aspects for Major Speech and Audio Processing Applications

Juraj Kacur; Boris Puterka; Jarmila Pavlovicova; Milos Oravec

doi:10.3390/s22166304

Frequency, Time, Representation and Modeling Aspects for Major Speech and Audio Processing Applications

Sensors (Basel). 2022 Aug 22;22(16):6304. doi: 10.3390/s22166304.

Authors

Juraj Kacur¹, Boris Puterka², Jarmila Pavlovicova², Milos Oravec³

Affiliations

¹ Institute of Multimedia Information and Communication Technologies, Faculty of Electrical Engineering and Information Technology STU, 812 19 Bratislava, Slovakia.
² Institute of Robotics and Cybernetics, Faculty of Electrical Engineering and Information Technology STU, 812 19 Bratislava, Slovakia.
³ Institute of Computer Science and Mathematics, Faculty of Electrical Engineering and Information Technology STU, 812 19 Bratislava, Slovakia.

Abstract

There are many speech and audio processing applications and their number is growing. They may cover a wide range of tasks, each having different requirements on the processed speech or audio signals and, therefore, indirectly, on the audio sensors as well. This article reports on tests and evaluation of the effect of basic physical properties of speech and audio signals on the recognition accuracy of major speech/audio processing applications, i.e., speech recognition, speaker recognition, speech emotion recognition, and audio event recognition. A particular focus is on frequency ranges, time intervals, a precision of representation (quantization), and complexities of models suitable for each class of applications. Using domain-specific datasets, eligible feature extraction methods and complex neural network models, it was possible to test and evaluate the effect of basic speech and audio signal properties on the achieved accuracies for each group of applications. The tests confirmed that the basic parameters do affect the overall performance and, moreover, this effect is domain-dependent. Therefore, accurate knowledge of the extent of these effects can be valuable for system designers when selecting appropriate hardware, sensors, architecture, and software for a particular application, especially in the case of limited resources.

Keywords: audio event recognition; convolutional neural networks; speaker recognition; speech emotions; speech features; speech recognition.

MeSH terms

Emotions
Neural Networks, Computer*
Software
Speech*

Grants and funding

ITMS code: 313021W404/Operational Program Integrated Infrastructure for the project: International Center of Excellence for Research on Intelligent and Secure Information and Communication Technologies and Systems - II. stage