An off-policy multi-agent stochastic policy gradient algorithm for cooperative continuous control

Delin Guo; Lan Tang; Xinggan Zhang; Ying-Chang Liang

doi:10.1016/j.neunet.2023.11.046

An off-policy multi-agent stochastic policy gradient algorithm for cooperative continuous control

Neural Netw. 2024 Feb:170:610-621. doi: 10.1016/j.neunet.2023.11.046. Epub 2023 Nov 23.

Authors

Delin Guo¹, Lan Tang², Xinggan Zhang³, Ying-Chang Liang⁴

Affiliations

¹ School of Electronic Science and Engineering, Nanjing University, Nanjing, 210093, China. Electronic address: guodelin@smail.nju.edu.cn.
² School of Electronic Science and Engineering, Nanjing University, Nanjing, 210093, China. Electronic address: tanglan@nju.edu.cn.
³ School of Electronic Science and Engineering, Nanjing University, Nanjing, 210093, China. Electronic address: zhxg@nju.edu.cn.
⁴ Center for Intelligent Networking and Communications, University of Electronic Science and Technology of China, Chengdu, 611731, China. Electronic address: liangyc@ieee.org.

PMID: 38056408
DOI: 10.1016/j.neunet.2023.11.046

Abstract

Multi-agent reinforcement learning (MARL) algorithms based on trust regions (TR) have achieved significant success in numerous cooperative multi-agent tasks. These algorithms restrain the Kullback-Leibler (KL) divergence (i.e., TR constraint) between the current and new policies to avoid aggressive update steps and improve learning performance. However, the majority of existing TR-based MARL algorithms are on-policy, meaning that they require new data sampled by current policies for training and cannot utilize off-policy (or historical) data, leading to low sample efficiency. This study aims to enhance the data efficiency of TR-based learning methods. To achieve this, an approximation of the original objective function is designed. In addition, it is proven that as long as the update size of the policy (measured by the KL divergence) is restricted, optimizing the designed objective function using historical data can guarantee the monotonic improvement of the original target. Building on the designed objective, a practical off-policy multi-agent stochastic policy gradient algorithm is proposed within the framework of centralized training with decentralized execution (CTDE). Additionally, policy entropy is integrated into the reward to promote exploration, and consequently, improve stability. Comprehensive experiments are conducted on a representative benchmark for multi-agent MuJoCo (MAMuJoCo), which offers a range of challenging tasks in cooperative continuous multi-agent control. The results demonstrate that the proposed algorithm outperforms all other existing algorithms by a significant margin.

Keywords: Deep reinforcement learning (DRL); Multi-agent MuJoCo; Multi-agent control; Multi-agent reinforcement learning (MARL); Trust region.

MeSH terms

Algorithms*
Benchmarking
Entropy
Learning*
Policy