PVGAN: A Pathological Voice Generation Model Incorporating a Progressive Nesting Strategy

Xiaoying Pan; Tong Feng; Nijuan Zhang

doi:10.1016/j.jvoice.2023.10.006

PVGAN: A Pathological Voice Generation Model Incorporating a Progressive Nesting Strategy

J Voice. 2023 Nov 6:S0892-1997(23)00315-6. doi: 10.1016/j.jvoice.2023.10.006. Online ahead of print.

Authors

Xiaoying Pan¹, Tong Feng², Nijuan Zhang²

Affiliations

¹ Shaanxi Key Laboratory of Network Data Analysis and Intelligent Processing, Xi'an University of Posts and Telecommunications, Xi'an 710121, China; School of Computer Science & Technology, Xi'an University of Posts and Telecommunications, Xi'an 710121, China. Electronic address: panxiaoying@xupt.edu.cn.
² Shaanxi Key Laboratory of Network Data Analysis and Intelligent Processing, Xi'an University of Posts and Telecommunications, Xi'an 710121, China; School of Computer Science & Technology, Xi'an University of Posts and Telecommunications, Xi'an 710121, China.

PMID: 37940422
DOI: 10.1016/j.jvoice.2023.10.006

Abstract

The voice generation task is to solve the problem of limited samples in the voice dataset using computer technology. By increasing the number of samples, the accuracy of voice disorder diagnosis can be improved, which has a wide range of application value in medical diagnosis and other fields. At present, there are insufficient models for detailed features such as pitch, timbre, and different frequency components in pathological voice data. Therefore, this paper proposes a PVGAN network for learning different frequency information of audio to generate pathological voice data. The proposed network captures the multi-scale features and different periodic patterns of audio signals by designing multiscale perceptual residual blocks and periodic discriminators. At the same time, a progressive nesting strategy was proposed to combine the generator and the discriminator to improve the learning ability of different resolution information. In addition, a latent mapping network is designed to fuse the latent vector with the condition information to generate sound features related to specific diseases or pathological states. The loss function is optimized to further improve the model performance. On the Saarbruecken Voice Database(SVD), the average values of each index of the data generated after training with different pathological types as conditional information are similar to the original data. Finally, the generated data were used to expand the SVD dataset, and the accuracy of the two classification experiments was improved to a certain extent.

Keywords: Deep learning; Generative adversarial networks; Voice analysis; Voice generation.