WToE: Learning When to Explore in Multiagent Reinforcement Learning

Shaokang Dong; Hangyu Mao; Shangdong Yang; Shengyu Zhu; Wenbin Li; Jianye Hao; Yang Gao

doi:10.1109/TCYB.2023.3328732

WToE: Learning When to Explore in Multiagent Reinforcement Learning

IEEE Trans Cybern. 2023 Nov 21:PP. doi: 10.1109/TCYB.2023.3328732. Online ahead of print.

Authors

Shaokang Dong, Hangyu Mao, Shangdong Yang, Shengyu Zhu, Wenbin Li, Jianye Hao, Yang Gao

PMID: 37988210
DOI: 10.1109/TCYB.2023.3328732

Abstract

Existing multiagent exploration works focus on how to explore in the fully cooperative task, which is insufficient in the environment with nonstationarity induced by agent interactions. To tackle this issue, we propose When to Explore (WToE), a simple yet effective variational exploration method to learn WToE under nonstationary environments. WToE employs an interaction-oriented adaptive exploration mechanism to adapt to environmental changes. We first propose a novel graphical model that uses a latent random variable to model the step-level environmental change resulting from interaction effects. Leveraging this graphical model, we employ the supervised variational auto-encoder (VAE) framework to derive a short-term inferred policy from historical trajectories to deal with the nonstationarity. Finally, agents engage in exploration when the short-term inferred policy diverges from the current actor policy. The proposed approach theoretically guarantees the convergence of the Q -value function. In our experiments, we validate our exploration mechanism in grid examples, multiagent particle environments and the battle game of MAgent environments. The results demonstrate the superiority of WToE over multiple baselines and existing exploration methods, such as MAEXQ, NoisyNets, EITI, and PR2.