An Analysis of the Value of Information When Exploring Stochastic, Discrete Multi-Armed Bandits

Isaac J Sledge; José C Príncipe

doi:10.3390/e20030155

An Analysis of the Value of Information When Exploring Stochastic, Discrete Multi-Armed Bandits

Entropy (Basel). 2018 Feb 28;20(3):155. doi: 10.3390/e20030155.

Authors

Isaac J Sledge^{1

2}, José C Príncipe^{1

2

3}

Affiliations

¹ Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL 32611, USA.
² Computational NeuroEngineering Laboratory (CNEL), University of Florida, Gainesville, FL 32611, USA.
³ Department of Biomedical Engineering, University of Florida, Gainesville, FL 32611, USA.

Abstract

In this paper, we propose an information-theoretic exploration strategy for stochastic, discrete multi-armed bandits that achieves optimal regret. Our strategy is based on the value of information criterion. This criterion measures the trade-off between policy information and obtainable rewards. High amounts of policy information are associated with exploration-dominant searches of the space and yield high rewards. Low amounts of policy information favor the exploitation of existing knowledge. Information, in this criterion, is quantified by a parameter that can be varied during search. We demonstrate that a simulated-annealing-like update of this parameter, with a sufficiently fast cooling schedule, leads to a regret that is logarithmic with respect to the number of arm pulls.

Keywords: exploitation; exploration; exploration-exploitation dilemma; information theory; multi-armed bandits; reinforcement learning.