A Multidomain Generative Adversarial Network for Hoarse-to-Normal Voice Conversion

Minghang Chu; Jing Wang; Zhiwei Fan; Mengtao Yang; Chao Xu; Yaoyao Ma; Zhi Tao; Di Wu

doi:10.1016/j.jvoice.2023.08.027

A Multidomain Generative Adversarial Network for Hoarse-to-Normal Voice Conversion

J Voice. 2023 Oct 14:S0892-1997(23)00274-6. doi: 10.1016/j.jvoice.2023.08.027. Online ahead of print.

Authors

Minghang Chu¹, Jing Wang¹, Zhiwei Fan¹, Mengtao Yang¹, Chao Xu¹, Yaoyao Ma¹, Zhi Tao¹, Di Wu²

Affiliations

¹ School of Optoelectronic Science and Engineering, Soochow University, Suzhou, Jiangsu, China.
² School of Optoelectronic Science and Engineering, Soochow University, Suzhou, Jiangsu, China. Electronic address: wudi@suda.edu.cn.

PMID: 37845148
DOI: 10.1016/j.jvoice.2023.08.027

Abstract

Hoarse voice affects the efficiency of communication between people. However, surgical treatment may result in patients with poorer voice quality, and voice repair techniques can only repair vowels. In this paper, we propose a novel multidomain generative adversarial voice conversion method to achieve hoarse-to-normal voice conversion and personalize voices for patients with hoarseness. The proposed method aims to improve the speech quality of hoarse voices through a multidomain generative adversarial network. The proposed method is evaluated on subjective and objective evaluation metrics. According to the findings of the spectrum analysis, the suggested method converts hoarse voice formants more effectively than variational auto-encoder (VAE), Auto-VC (voice conversion), StarGAN-VC (Generative Adversarial Network- Voice Conversion), and CycleVAE. For the word error rate, the suggested method obtains absolute gains of 35.62, 37.97, 45.42, and 50.05 compared to CycleVAE, StarGAN-VC, Auto-VC, and VAE, respectively. The suggested method achieves CycleVAE, VAE, StarGAN-VC, and Auto-VC, respectively, in terms of naturalness by 42.49%, 51.60%, 69.37%, and 77.54%. The suggested method outperforms VAE, CycleVAE, StarGAN-VC, and Auto-VC, respectively, in terms of intelligibility, with absolute gains of 0.87, 0.93, 1.08, and 1.13. In terms of content similarity, the proposed method obtains 43.48%, 75.52%, 76.21%, and 108.62% improvements compared to CycleVAE, StarGAN-VC, Auto-VC, and VAE, respectively. ABX results show that the suggested method can personalize the voice for patients with hoarseness. This study demonstrates the feasibility of voice conversion methods in improving the speech quality of hoarse voices.

Keywords: Artificial intelligence; Health sciences; Hoarse voice conversion; Intelligibility; Multidomain generative adversarial network; Pathological voice.