Empirical Mode Decomposition-Based Feature Extraction for Environmental Sound Classification

Ammar Ahmed; Youssef Serrestou; Kosai Raoof; Jean-François Diouris

doi:10.3390/s22207717

Empirical Mode Decomposition-Based Feature Extraction for Environmental Sound Classification

Sensors (Basel). 2022 Oct 11;22(20):7717. doi: 10.3390/s22207717.

Authors

Ammar Ahmed¹, Youssef Serrestou¹, Kosai Raoof¹, Jean-François Diouris²

Affiliations

¹ Laboratoire d'Acoustique de l'Université du Mans (LAUM), UMR 6613, Institut d'Acoustique-Graduate School (IA-GS), CNRS, Le Mans Université, 72085 Le Mans, France.
² CNRS, IETR UMR 6164, Université de Nantes, 85000 La Roche-sur-Yon, France.

Abstract

In environment sound classification, log Mel band energies (MBEs) are considered as the most successful and commonly used features for classification. The underlying algorithm, fast Fourier transform (FFT), is valid under certain restrictions. In this study, we address these limitations of Fourier transform and propose a new method to extract log Mel band energies using amplitude modulation and frequency modulation. We present a comparative study between traditionally used log Mel band energy features extracted by Fourier transform and log Mel band energy features extracted by our new approach. This approach is based on extracting log Mel band energies from estimation of instantaneous frequency (IF) and instantaneous amplitude (IA), which are used to construct a spectrogram. The estimation of IA and IF is made by associating empirical mode decomposition (EMD) with the Teager-Kaiser energy operator (TKEO) and the discrete energy separation algorithm. Later, Mel filter bank is applied to the estimated spectrogram to generate EMD-TKEO-based MBEs, or simply, EMD-MBEs. In addition, we employ the EMD method to remove signal trends from the original signal and generate another type of MBE, called S-MBEs, using FFT and a Mel filter bank. Four different datasets were utilised and convolutional neural networks (CNN) were trained using features extracted from Fourier transform-based MBEs (FFT-MBEs), EMD-MBEs, and S-MBEs. In addition, CNNs were trained with an aggregation of all three feature extraction techniques and a combination of FFT-MBEs and EMD-MBEs. Individually, FFT-MBEs achieved higher accuracy compared to EMD-MBEs and S-MBEs. In general, the system trained with the combination of all three features performed slightly better compared to the system trained with the three features separately.

Keywords: acoustic signals; convolutional neural networks; empirical mode decomposition; environment sound classification; intrinsic mode function; signal processing; time–frequency representations.

MeSH terms

Algorithms*
Fourier Analysis
Neural Networks, Computer*
Signal Processing, Computer-Assisted
Sound

Grants and funding

This research was funded by the Projet régional Recherche Formation Innovation RFI WISE under the project name of CAPAHI funded by the Région des Pays de Loire, France.