VAG: A Uniform Model for Cross-Modal Visual-Audio Mutual Generation

IEEE Trans Neural Netw Learn Syst. 2022 Apr 8:PP. doi: 10.1109/TNNLS.2022.3161314. Online ahead of print.

Abstract

Considering both audio and visual modalities is helpful for understanding a video. In the face of harsh environmental interference or signal packet loss, automatically compensating for audio and vision is a challenging task. We propose a dynamic cross-modal visual-audio mutual generation model (VAMG), which includes audio to visual conversion, visual to audio conversion, audio self-generation, and visual self-generation. VAMG jointly optimizes modal reconstruction and adversarial constraints, effectively solving the problems of structural alignment and signal compensation in incomplete videos. We conducted an instrument-oriented and pose-oriented cross-modal audio-visual mutual generation experiment on the sub-University of Rochester Musical Performance dataset to verify the effectiveness of the model.