ARROW: Augmented Replay for RObust World models

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

ARROW (Augmented Replay for RObust World models) is a novel model-based continual reinforcement learning algorithm that enhances DreamerV3 with a memory-efficient, distribution-matching replay buffer. Unlike standard fixed-size buffers, ARROW employs two complementary buffers: a short-term FIFO buffer and a long-term global distribution matching buffer, each with a capacity of 2^18 observations, totaling 2^19 observations. Evaluated across challenging continual RL settings, including Atari games (without shared structure) and Procgen CoinRun variants (with shared structure), ARROW demonstrated substantially less catastrophic forgetting. For instance, on Atari tasks, it reduced forgetting by over six-fold compared to DreamerV3 (0.197 vs. 1.217) and achieved near-zero forgetting on CoinRun. The algorithm also exhibited a superior stability-plasticity trade-off, with WC-ACC scores up to 0.615 for Atari, and exceptional performance recovery in two-cycle training scenarios.

Key takeaway

For machine learning engineers deploying reinforcement learning agents in environments requiring continuous skill acquisition, ARROW offers a robust solution to catastrophic forgetting. You should consider implementing a dual-buffer replay system, combining short-term recency with long-term distribution matching, to maintain performance across diverse tasks. This approach significantly improves stability and knowledge retention, especially in scenarios without shared task structure, enabling more reliable lifelong learning systems.

Key insights

Augmented replay to a World Model significantly reduces catastrophic forgetting in continual reinforcement learning.

Principles

Distribution-matching replay preserves World Model accuracy.
Two-buffer system balances recency and task diversity.
Model-based RL with replay supports off-policy learning.

Method

ARROW extends DreamerV3 with a short-term FIFO buffer and a long-term global distribution matching buffer, sampled uniformly. It uses spliced rollouts and fixed-entropy regularization for exploration.

In practice

Use dual replay buffers for continual learning.
Prioritize distribution matching for long-term retention.
Apply reward scaling for tasks with varied magnitudes.

Topics

Continual Reinforcement Learning
World Models
Replay Buffers
Catastrophic Forgetting
DreamerV3
Atari Games
Procgen CoinRun

Code references

danijar/dreamerv3

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.