Polyphonic training set synthesis improves self-supervised urban sound classification

Félix Gontier; Vincent Lostanlen; Mathieu Lagrange; Nicolas Fortin; Catherine Lavandier; Jean-François Petiot

doi:10.1121/10.0005277

Polyphonic training set synthesis improves self-supervised urban sound classification

J Acoust Soc Am. 2021 Jun;149(6):4309. doi: 10.1121/10.0005277.

Authors

Félix Gontier¹, Vincent Lostanlen¹, Mathieu Lagrange¹, Nicolas Fortin², Catherine Lavandier³, Jean-François Petiot⁴

Affiliations

¹ CNRS, LS2N, F-44322 Nantes, France.
² Unité Mixte de Recherche en Acoustique Environnementale, Université Gustave Eiffel, Centre d'Etudes et d'Expertise sur les Risques, l'Environnement, la Mobilité et l'Aménagement, F-44344 Bouguenais, France.
³ CY Cergy Paris Université École Nationale Supérieure de l'électronique et de ses Applications (ENSEA), CNRS, ETIS, F-95000 Cergy, France.
⁴ École Centrale de Nantes, LS2N, F-44322 Nantes, France.

PMID: 34241459
DOI: 10.1121/10.0005277

Abstract

Machine listening systems for environmental acoustic monitoring face a shortage of expert annotations to be used as training data. To circumvent this issue, the emerging paradigm of self-supervised learning proposes to pre-train audio classifiers on a task whose ground truth is trivially available. Alternatively, training set synthesis consists in annotating a small corpus of acoustic events of interest, which are then automatically mixed at random to form a larger corpus of polyphonic scenes. Prior studies have considered these two paradigms in isolation but rarely ever in conjunction. Furthermore, the impact of data curation in training set synthesis remains unclear. To fill this gap in research, this article proposes a two-stage approach. In the self-supervised stage, we formulate a pretext task (Audio2Vec skip-gram inpainting) on unlabeled spectrograms from an acoustic sensor network. Then, in the supervised stage, we formulate a downstream task of multilabel urban sound classification on synthetic scenes. We find that training set synthesis benefits overall performance more than self-supervised learning. Interestingly, the geographical origin of the acoustic events in training set synthesis appears to have a decisive impact.

MeSH terms

Acoustics*
Sound*