ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition

Fei Deng; Lihong Deng; Peifan Jiang; Gexiang Zhang; Qiang Yang

doi:10.3390/s23031203

ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition

Sensors (Basel). 2023 Jan 20;23(3):1203. doi: 10.3390/s23031203.

Authors

Fei Deng¹, Lihong Deng¹, Peifan Jiang¹, Gexiang Zhang^{2

3}, Qiang Yang³

Affiliations

¹ College of Computer Science and Cyber Security (Oxford Brookes College), Chengdu University of Technology, Chengdu 610059, China.
² Artificial Intelligence Research Center, Chengdu University of Technology, Chengdu 610059, China.
³ School of Control Engineering, Chengdu University of Information Engineering, Chengdu 610059, China.

Abstract

In speaker recognition tasks, convolutional neural network (CNN)-based approaches have shown significant success. Modeling the long-term contexts and efficiently aggregating the information are two challenges in speaker recognition, and they have a critical impact on system performance. Previous research has addressed these issues by introducing deeper, wider, and more complex network architectures and aggregation methods. However, it is difficult to significantly improve the performance with these approaches because they also have trouble fully utilizing global information, channel information, and time-frequency information. To address the above issues, we propose a lighter and more efficient CNN-based end-to-end speaker recognition architecture, ResSKNet-SSDP. ResSKNet-SSDP consists of a residual selective kernel network (ResSKNet) and self-attentive standard deviation pooling (SSDP). ResSKNet can capture long-term contexts, neighboring information, and global information, thus extracting a more informative frame-level. SSDP can capture short- and long-term changes in frame-level features, aggregating the variable-length frame-level features into fixed-length, more distinctive utterance-level features. Extensive comparison experiments were performed on two popular public speaker recognition datasets, Voxceleb and CN-Celeb, with current state-of-the-art speaker recognition systems and achieved the lowest EER/DCF of 2.33%/0.2298, 2.44%/0.2559, 4.10%/0.3502, and 12.28%/0.5051. Compared with the lightest x-vector, our designed ResSKNet-SSDP has 3.1 M fewer parameters and 31.6 ms less inference time, but 35.1% better performance. The results show that ResSKNet-SSDP significantly outperforms the current state-of-the-art speaker recognition architectures on all test sets and is an end-to-end architecture with fewer parameters and higher efficiency for applications in realistic situations. The ablation experiments further show that our proposed approaches also provide significant improvements over previous methods.

Keywords: aggregation model; end-to-end; selective kernel convolution; speaker recognition.

MeSH terms

Algorithms*
Attention
Neural Networks, Computer*

Grants and funding

This work is supported by National Natural Science Foundation of China [grant number 61972324] and Sichuan Science and Technology Program [grant number 2021YFS0313 and 2021YFG0133].