Frequency, Time, Representation and Modeling Aspects for Major Speech and Audio Processing Applications

Sensors (Basel). 2022 Aug 22;22(16):6304. doi: 10.3390/s22166304.

Abstract

There are many speech and audio processing applications and their number is growing. They may cover a wide range of tasks, each having different requirements on the processed speech or audio signals and, therefore, indirectly, on the audio sensors as well. This article reports on tests and evaluation of the effect of basic physical properties of speech and audio signals on the recognition accuracy of major speech/audio processing applications, i.e., speech recognition, speaker recognition, speech emotion recognition, and audio event recognition. A particular focus is on frequency ranges, time intervals, a precision of representation (quantization), and complexities of models suitable for each class of applications. Using domain-specific datasets, eligible feature extraction methods and complex neural network models, it was possible to test and evaluate the effect of basic speech and audio signal properties on the achieved accuracies for each group of applications. The tests confirmed that the basic parameters do affect the overall performance and, moreover, this effect is domain-dependent. Therefore, accurate knowledge of the extent of these effects can be valuable for system designers when selecting appropriate hardware, sensors, architecture, and software for a particular application, especially in the case of limited resources.

Keywords: audio event recognition; convolutional neural networks; speaker recognition; speech emotions; speech features; speech recognition.

MeSH terms

  • Emotions
  • Neural Networks, Computer*
  • Software
  • Speech*