Policy Correction and State-Conditioned Action Evaluation for Few-Shot Lifelong Deep Reinforcement Learning

IEEE Trans Neural Netw Learn Syst. 2024 Apr 30:PP. doi: 10.1109/TNNLS.2024.3385570. Online ahead of print.

Abstract

Lifelong deep reinforcement learning (DRL) approaches are commonly employed to adapt continuously to new tasks without forgetting previously acquired knowledge. While current lifelong DRL methods have shown promising advancements in retaining acquired knowledge, they suffer from significant adaptation efforts (i.e., longer training duration) and suboptimal policy when transferring to a new task that significantly deviates from previously learned tasks, a phenomenon known as the few-shot generalization challenge. In this work, we propose a generic approach that equips existing lifelong DRL methods with the capability of few-shot generalization. First, we employ selective experience reuse by leveraging the experience of encountered states, improving adaptation training for new tasks. Then, a relaxed softmax function is applied to the target Q values to improve the accuracy of evaluated Q values, leading to more optimal policies. Finally, we measure and reduce the discrepancy in data distribution between the policy and off-policy samples, resulting in improved adaptation efficiency. Extensive experiments have been conducted on three typical benchmarks to compare our approach with six representative lifelong DRL methods and two state-of-the-art (SOTA) few-shot DRL methods regarding their training speed, episode return, and average return of all episodes. Experimental results substantiate that our method improves the return of six lifelong DRL methods by at least 25%.