Replay What Matters: Off-Policy Replay for Efficient LLM Reinforcement Unlearning

2026-06-13 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

ReRULE introduces an off-policy replay enhancement for LLM reinforcement unlearning, addressing inefficiencies in existing RL-based methods like RULE. Current approaches repeatedly sample from easy cases and discard low-reward hard-case rollouts, wasting computational resources. ReRULE tackles this by storing low-reward hard-case rollout groups in a replay buffer during early GRPO training and reusing them in later stages through importance-sampled off-policy updates, thereby focusing computation on challenging boundary cases. Theoretically, ReRULE offers a tighter hard-case convergence bound. Empirically, it improves MUSE-Books Retain Quality from 46.3 to 56.2, adding only 5-11% to training time across benchmarks, with benefits more pronounced in complex scenarios like MUSE-Books compared to the simpler TOFU setting.

Key takeaway

For Machine Learning Engineers developing LLM unlearning solutions, if you are struggling with computational inefficiency or suboptimal retain quality, consider implementing off-policy replay. ReRULE demonstrates that storing and reusing hard-case rollouts significantly improves retain quality (e.g., MUSE-Books from 46.3 to 56.2) while adding only 5-11% to training time. This approach allows you to redirect computation towards critical boundary cases, making your unlearning process more effective and resource-efficient, especially for complex datasets.

Key insights

Off-policy replay efficiently targets hard cases in LLM reinforcement unlearning, improving performance with minimal training overhead.

Principles

On-policy RL unlearning wastes computation on easy cases.
Reusing low-reward hard-case rollouts improves convergence.
Replay benefits are proportional to hard/easy case disparity.

Method

ReRULE stores low-reward hard-case rollout groups in a replay buffer during early GRPO training, then reuses them in later stages via importance-sampled off-policy updates.

In practice

Apply off-policy replay to focus unlearning on boundary cases.
Prioritize replay for complex datasets with varied difficulty.
Use replay to boost retain quality with minor training cost.

Topics

LLM Unlearning
Reinforcement Learning
Off-Policy Replay
GRPO
MUSE-Books
TOFU Dataset

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.