Scaling Up Q-Learning via Exploiting State-Action Equivalence

Yunlian Lyu; Aymeric Côme; Yijie Zhang; Mohammad Sadegh Talebi

doi:10.3390/e25040584

Scaling Up Q-Learning via Exploiting State-Action Equivalence

Entropy (Basel). 2023 Mar 29;25(4):584. doi: 10.3390/e25040584.

Authors

Yunlian Lyu^{1

2}, Aymeric Côme³, Yijie Zhang¹, Mohammad Sadegh Talebi¹

Affiliations

¹ Department of Computer Science, University of Copenhagen, Universitetsparken 1, 2100 Copenhagen, Denmark.
² School of Computer Science and Engineering, University of Electronic Science and Technology of China, Xiyuan Ave., Chengdu 611731, China.
³ Inria Rennes, Bretagne Atlantique Campus Universitaire de Beaulieu, Avenue du Général Leclerc, 35042 Rennes, France.

Abstract

Recent success stories in reinforcement learning have demonstrated that leveraging structural properties of the underlying environment is key in devising viable methods capable of solving complex tasks. We study off-policy learning in discounted reinforcement learning, where some equivalence relation in the environment exists. We introduce a new model-free algorithm, called QL-ES (Q-learning with equivalence structure), which is a variant of (asynchronous) Q-learning tailored to exploit the equivalence structure in the MDP. We report a non-asymptotic PAC-type sample complexity bound for QL-ES, thereby establishing its sample efficiency. This bound also allows us to quantify the superiority of QL-ES over Q-learning analytically, which shows that the theoretical gain in some domains can be massive. We report extensive numerical experiments demonstrating that QL-ES converges significantly faster than (structure-oblivious) Q-learning empirically. They imply that the empirical performance gain obtained by exploiting the equivalence structure could be massive, even in simple domains. To the best of our knowledge, QL-ES is the first provably efficient model-free algorithm to exploit the equivalence structure in finite MDPs.

Keywords: Markov decision process; Q-learning; equivalence structure; reinforcement learning.

Grants and funding

This research received no external funding.