Value Bonuses using Ensemble Errors for Exploration in Reinforcement Learning

· Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Advanced, extended

Summary

The paper introduces Value Bonuses using Ensemble Errors (VBE), a novel algorithm designed to enhance exploration and sample efficiency in Reinforcement Learning (RL). VBE addresses limitations of existing optimistic value estimation methods, which can be complex or incompatible with base RL algorithms. It builds on ensemble-based approaches but avoids propagating additional reward bonuses, allowing for "first-visit optimism" and "deep exploration." VBE maintains an ensemble of random action-value functions (RQFs) and uses the errors in their estimation to generate value bonuses. The algorithm is compatible with any base RL algorithm, such as Double DQN, with minimal computational overhead. Empirical evaluations on classic control problems (Sparse Mountain Car, Puddle World, River Swim, Deepsea) and Atari environments (Breakout, Pong, Q*bert, Pitfall, Private-Eye, Gravitar) demonstrate that VBE outperforms or matches baselines like Bootstrap DQN (BDQN), RND, and ACB, particularly in scenarios requiring extensive state coverage.

Key takeaway

For Research Scientists developing deep RL agents, VBE offers a straightforward and effective method to improve exploration and sample efficiency. By layering VBE's ensemble-error-based value bonuses onto existing algorithms like Double DQN, you can achieve superior performance in environments requiring first-visit optimism and deep exploration, without significant algorithmic changes or computational burden. This approach provides a robust alternative to simpler $\epsilon$-greedy strategies, especially in complex Atari and classic control domains.

Key insights

VBE uses ensemble errors to create value bonuses for first-visit optimism and deep exploration in RL.

Principles

Method

VBE maintains an ensemble of random action-value functions (RQFs) and defines rewards consistent with these RQFs. It updates RQF predictors using temporal difference learning, with value bonuses derived from the maximum absolute error across the ensemble.

In practice

Topics

Code references

Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.