The Actor-Dueling-Critic Method for Reinforcement Learning

Menghao Wu; Yanbin Gao; Alexander Jung; Qiang Zhang; Shitong Du

doi:10.3390/s19071547

The Actor-Dueling-Critic Method for Reinforcement Learning

Sensors (Basel). 2019 Mar 30;19(7):1547. doi: 10.3390/s19071547.

Authors

Menghao Wu^{1

2}, Yanbin Gao³, Alexander Jung⁴, Qiang Zhang⁵, Shitong Du⁶

Affiliations

¹ College of Automation, Harbin Engineering University, Harbin 150001, China. wumenghao@hrbeu.edu.cn.
² Department of Computer Science, Aalto University, 02150 Espoo, Finland. wumenghao@hrbeu.edu.cn.
³ College of Automation, Harbin Engineering University, Harbin 150001, China. gaoyanbin@hrbeu.edu.cn.
⁴ Department of Computer Science, Aalto University, 02150 Espoo, Finland. alexander.jung@aalto.fi.
⁵ College of Automation, Harbin Engineering University, Harbin 150001, China. 18846425693@hrbeu.edu.cn.
⁶ College of Automation, Harbin Engineering University, Harbin 150001, China. dushitong@hrbeu.edu.cn.

Abstract

Model-free reinforcement learning is a powerful and efficient machine-learning paradigm which has been generally used in the robotic control domain. In the reinforcement learning setting, the value function method learns policies by maximizing the state-action value (Q value), but it suffers from inaccurate Q estimation and results in poor performance in a stochastic environment. To mitigate this issue, we present an approach based on the actor-critic framework, and in the critic branch we modify the manner of estimating Q-value by introducing the advantage function, such as dueling network, which can estimate the action-advantage value. The action-advantage value is independent of state and environment noise, we use it as a fine-tuning factor to the estimated Q value. We refer to this approach as the actor-dueling-critic (ADC) network since the frame is inspired by the dueling network. Furthermore, we redesign the dueling network part in the critic branch to make it adapt to the continuous action space. The method was tested on gym classic control environments and an obstacle avoidance environment, and we design a noise environment to test the training stability. The results indicate the ADC approach is more stable and converges faster than the DDPG method in noise environments.

Keywords: DDPG; advantage; continuous control; dueling network; reinforcement learning.

MeSH terms

Algorithms*
Deep Learning
Markov Chains
Robotics

Abstract

MeSH terms

Grants and funding