CVaR-Constrained Policy Optimization for Safe Reinforcement Learning

IEEE Trans Neural Netw Learn Syst. 2024 Feb 23:PP. doi: 10.1109/TNNLS.2023.3331304. Online ahead of print.

Abstract

Current constrained reinforcement learning (RL) methods guarantee constraint satisfaction only in expectation, which is inadequate for safety-critical decision problems. Since a constraint satisfied in expectation remains a high probability of exceeding the cost threshold, solving constrained RL problems with high probabilities of satisfaction is critical for RL safety. In this work, we consider the safety criterion as a constraint on the conditional value-at-risk (CVaR) of cumulative costs, and propose the CVaR-constrained policy optimization algorithm (CVaR-CPO) to maximize the expected return while ensuring agents pay attention to the upper tail of constraint costs. According to the bound on the CVaR-related performance between two policies, we first reformulate the CVaR-constrained problem in augmented state space using the state extension procedure and the trust-region method. CVaR-CPO then derives the optimal update policy by applying the Lagrangian method to the constrained optimization problem. In addition, CVaR-CPO utilizes the distribution of constraint costs to provide an efficient quantile-based estimation of the CVaR-related value function. We conduct experiments on constrained control tasks to show that the proposed method can produce behaviors that satisfy safety constraints, and achieve comparable performance to most safe RL (SRL) methods.