Efficient Reinforcement Learning via Probabilistic Trajectory Optimization

IEEE Trans Neural Netw Learn Syst. 2018 Nov;29(11):5459-5474. doi: 10.1109/TNNLS.2017.2764499. Epub 2018 Mar 5.

Abstract

We present a trajectory optimization approach to reinforcement learning in continuous state and action spaces, called probabilistic differential dynamic programming (PDDP). Our method represents systems dynamics using Gaussian processes (GPs), and performs local dynamic programming iteratively around a nominal trajectory in Gaussian belief spaces. Different from model-based policy search methods, PDDP does not require a policy parameterization and learns a time-varying control policy via successive forward-backward sweeps. A convergence analysis of the iterative scheme is given, showing that our algorithm converges to a stationary point globally under certain conditions. We show that prior model knowledge can be incorporated into the proposed framework to speed up learning, and a generalized optimization criterion based on the predicted cost distribution can be employed to enable risk-sensitive learning. We demonstrate the effectiveness and efficiency of the proposed algorithm using nontrivial tasks. Compared with a state-of-the-art GP-based policy search method, PDDP offers a superior combination of learning speed, data efficiency, and applicability.

Publication types

  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, Non-P.H.S.