Few-shot short utterance speaker verification using meta-learning

PeerJ Comput Sci. 2023 Apr 21:9:e1276. doi: 10.7717/peerj-cs.1276. eCollection 2023.

Abstract

Short utterance speaker verification (SV) in the actual application is the task of accepting or rejecting the identity claim of a speaker based on a few enrollment utterances. Traditional methods have used deep neural networks to extract speaker representations for verification. Recently, several meta-learning approaches have learned a deep distance metric to distinguish speakers within meta-tasks. Among them, a prototypical network learns a metric space that may be used to compute the distance to the prototype center of speakers, in order to classify speaker identity. We use emphasized channel attention, propagation and aggregation in TDNN (ECAPA-TDNN) to implement the necessary function for the prototypical network, which is a nonlinear mapping from the input space to the metric space for either few-shot SV task. In addition, optimizing only for speakers in given meta-tasks cannot be sufficient to learn distinctive speaker features. Thus, we used an episodic training strategy, in which the classes of the support and query sets correspond to the classes of the entire training set, further improving the model performance. The proposed model outperforms comparison models on the VoxCeleb1 dataset and has a wide range of practical applications.

Keywords: Episodic training strategy; Global classification; Meta-learning; Prototypical network; Speaker verification; Support set.

Grants and funding

This research work was supported by the National Science Foundation of China (No.62166025); the Science and Technology project of Gansu Province (No.21YF5GA073); and the Gansu Province Department of Education: Outstanding Graduate Student “Innovation Star” Project (No.2021CXCX-512, 2021CXCX-511). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.