Short-time speaker verification with different speaking style utterances

Hongwei Mao; Yan Shi; Yue Liu; Linqiang Wei; Yijie Li; Yanhua Long

doi:10.1371/journal.pone.0241809

Short-time speaker verification with different speaking style utterances

PLoS One. 2020 Nov 11;15(11):e0241809. doi: 10.1371/journal.pone.0241809. eCollection 2020.

Authors

Hongwei Mao¹, Yan Shi¹, Yue Liu¹, Linqiang Wei¹, Yijie Li², Yanhua Long¹

Affiliations

¹ SHNU-Unisound Joint Laboratory of Natural Human-Computer Interaction, Shanghai Normal University, Shanghai, China.
² Unisound AI Technology Co., Ltd., Beijing, China.

Abstract

In recent years, great progress has been made in the technical aspects of automatic speaker verification (ASV). However, the promotion of ASV technology is still a very challenging issue, because most technologies are still very sensitive to new, unknown and spoofing conditions. Most previous studies focused on extracting target speaker information from natural speech. This paper aims to design a new ASV corpus with multi-speaking styles and investigate the ASV robustness to these different speaking styles. We first release this corpus in the Zenodo website for public research, in which each speaker has several text-dependent and text-independent singing, humming and normal reading speech utterances. Then, we investigate the speaker discrimination of each speaking style in the feature space. Furthermore, the intra and inter-speaker variabilities in each different speaking style and cross-speaking styles are investigated in both text-dependent and text-independent ASV tasks. Conventional Gaussian Mixture Model (GMM), and the state-of-the-art x-vector are used to build ASV systems. Experimental results show that the voiceprint information in humming and singing speech are more distinguishable than that in normal reading speech for conventional ASV systems. Furthermore, we find that combing the three speaking styles can significantly improve the x-vector based ASV system, even when only limited gains are obtained by conventional GMM-based systems.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Humans
Normal Distribution
Speech
Speech Acoustics*
Speech Perception

Grants and funding

This work was funded by the Project 61701306 and 62071302 supported by National Natural Science Foundation of China. The funder provided support in the form of salaries for authors L. Wei, Y. Liu, Y. Shi, but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. Unisound AI Technology Co., Ltd. provided support in the form of a salary for author Y. Li. The specific roles of these authors are articulated in the ‘author contributions’ section.