GLGAN-VC: A Guided Loss-Based Generative Adversarial Network for Many-to-Many Voice Conversion

Sandipan Dhar; Nanda Dulal Jana; Swagatam Das

doi:10.1109/TNNLS.2023.3335119

GLGAN-VC: A Guided Loss-Based Generative Adversarial Network for Many-to-Many Voice Conversion

IEEE Trans Neural Netw Learn Syst. 2023 Dec 4:PP. doi: 10.1109/TNNLS.2023.3335119. Online ahead of print.

Authors

Sandipan Dhar, Nanda Dulal Jana, Swagatam Das

PMID: 38048246
DOI: 10.1109/TNNLS.2023.3335119

Abstract

Many-to-many voice conversion (VC) is a technique aimed at mapping speech features between multiple speakers during training and transferring the vocal characteristics of one source speaker to another target speaker, all while maintaining the content of the source speech unchanged. Existing research highlights a notable gap between the original and generated speech samples in terms of naturalness within many-to-many VC. Therefore, there is substantial room for improvement in achieving more natural-sounding speech samples for both parallel and nonparallel VC scenarios. In this study, we introduce a generative adversarial network (GAN) system with a guided loss (GLGAN-VC) designed to enhance many-to-many VC by focusing on architectural improvements and the integration of alternative loss functions. Our approach includes a pair-wise downsampling and upsampling (PDU) generator network for effective speech feature mapping (FM) in multidomain VC. In addition, we incorporate an FM loss to preserve content information and a residual connection (RC)-based discriminator network to improve learning. A guided loss (GL) function is introduced to efficiently capture differences in latent feature representations between source and target speakers, and an enhanced reconstruction loss is proposed for better contextual information preservation. We evaluate our model on various datasets, including VCC 2016, VCC 2018, VCC 2020, and an emotional speech dataset (ESD). Our results, based on both subjective and objective evaluation metrics, demonstrate that our model outperforms state-of-the-art (SOTA) many-to-many GAN-based VC models in terms of speech quality and speaker similarity in the generated speech samples.