Sampling Efficient Deep Reinforcement Learning Through Preference-Guided Stochastic Exploration

IEEE Trans Neural Netw Learn Syst. 2023 Oct 3:PP. doi: 10.1109/TNNLS.2023.3317628. Online ahead of print.

Abstract

Stochastic exploration is the key to the success of the deep Q -network (DQN) algorithm. However, most existing stochastic exploration approaches either explore actions heuristically regardless of their Q values or couple the sampling with Q values, which inevitably introduce bias into the learning process. In this article, we propose a novel preference-guided ϵ -greedy exploration algorithm that can efficiently facilitate exploration for DQN without introducing additional bias. Specifically, we design a dual architecture consisting of two branches, one of which is a copy of DQN, namely, the Q branch. The other branch, which we call the preference branch, learns the action preference that the DQN implicitly follows. We theoretically prove that the policy improvement theorem holds for the preference-guided ϵ -greedy policy and experimentally show that the inferred action preference distribution aligns with the landscape of corresponding Q values. Intuitively, the preference-guided ϵ -greedy exploration motivates the DQN agent to take diverse actions, so that actions with larger Q values can be sampled more frequently, and those with smaller Q values still have a chance to be explored, thus encouraging the exploration. We comprehensively evaluate the proposed method by benchmarking it with well-known DQN variants in nine different environments. Extensive results confirm the superiority of our proposed method in terms of performance and convergence speed.