Guided Policy Exploration for Markov Decision Processes Using an Uncertainty-Based Value-of-Information Criterion

IEEE Trans Neural Netw Learn Syst. 2018 Jun;29(6):2080-2098. doi: 10.1109/TNNLS.2018.2812709.

Abstract

Reinforcement learning in environments with many action-state pairs is challenging. The issue is the number of episodes needed to thoroughly search the policy space. Most conventional heuristics address this search problem in a stochastic manner. This can leave large portions of the policy space unvisited during the early training stages. In this paper, we propose an uncertainty-based, information-theoretic approach for performing guided stochastic searches that more effectively cover the policy space. Our approach is based on the value of information, a criterion that provides the optimal tradeoff between expected costs and the granularity of the search process. The value of information yields a stochastic routine for choosing actions during learning that can explore the policy space in a coarse to fine manner. We augment this criterion with a state-transition uncertainty factor, which guides the search process into previously unexplored regions of the policy space. We evaluate the uncertainty-based value-of-information policies on the games Centipede and Crossy Road. Our results indicate that our approach yields better performing policies in fewer episodes than stochastic-based exploration strategies. We show that the training rate for our approach can be further improved by using the policy cross entropy to guide our criterion's hyperparameter selection.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.
  • Research Support, Non-U.S. Gov't