Bidirectional Attention for Text-Dependent Speaker Verification

Xin Fang; Tian Gao; Liang Zou; Zhenhua Ling

doi:10.3390/s20236784

Bidirectional Attention for Text-Dependent Speaker Verification

Sensors (Basel). 2020 Nov 27;20(23):6784. doi: 10.3390/s20236784.

Authors

Xin Fang^{1

2}, Tian Gao¹, Liang Zou^{3

4}, Zhenhua Ling¹

Affiliations

¹ School of Information Science and Technology, University of Science and Technology of China, Hefei 230022, China.
² iFLYTEK Research, iFLYTEK Co., Ltd., Hefei 230088, China.
³ School of Information and Electrical Control Engineering, China University of Mining and Technology, Xuzhou 221116, China.
⁴ School of Electronics and Information Engineering, Anhui University, Hefei 236601, China.

Abstract

Automatic speaker verification provides a flexible and effective way for biometric authentication. Previous deep learning-based methods have demonstrated promising results, whereas a few problems still require better solutions. In prior works examining speaker discriminative neural networks, the speaker representation of the target speaker is regarded as a fixed one when comparing with utterances from different speakers, and the joint information between enrollment and evaluation utterances is ignored. In this paper, we propose to combine CNN-based feature learning with a bidirectional attention mechanism to achieve better performance with only one enrollment utterance. The evaluation-enrollment joint information is exploited to provide interactive features through bidirectional attention. In addition, we introduce one individual cost function to identify the phonetic contents, which contributes to calculating the attention score more specifically. These interactive features are complementary to the constant ones, which are extracted from individual speakers separately and do not vary with the evaluation utterances. The proposed method archived a competitive equal error rate of 6.26% on the internal "DAN DAN NI HAO" benchmark dataset with 1250 utterances and outperformed various baseline methods, including the traditional i-vector/PLDA, d-vector, self-attention, and sequence-to-sequence attention models.

Keywords: CNN; bidirectional attention; interactive representation; text-dependent speaker verification.

MeSH terms

Algorithms
Biometric Identification*
Neural Networks, Computer*

Abstract

MeSH terms

Grants and funding