Replay What Matters: Off-Policy Replay for Efficient LLM Reinforcement Unlearning
Summary
ReRULE introduces an off-policy replay enhancement for LLM reinforcement unlearning, addressing inefficiencies in existing RL-based methods like RULE. Current approaches repeatedly sample from easy cases and discard low-reward hard-case rollouts, wasting computational resources. ReRULE tackles this by storing low-reward hard-case rollout groups in a replay buffer during early GRPO training and reusing them in later stages through importance-sampled off-policy updates, thereby focusing computation on challenging boundary cases. Theoretically, ReRULE offers a tighter hard-case convergence bound. Empirically, it improves MUSE-Books Retain Quality from 46.3 to 56.2, adding only 5-11% to training time across benchmarks, with benefits more pronounced in complex scenarios like MUSE-Books compared to the simpler TOFU setting.
Key takeaway
For Machine Learning Engineers developing LLM unlearning solutions, if you are struggling with computational inefficiency or suboptimal retain quality, consider implementing off-policy replay. ReRULE demonstrates that storing and reusing hard-case rollouts significantly improves retain quality (e.g., MUSE-Books from 46.3 to 56.2) while adding only 5-11% to training time. This approach allows you to redirect computation towards critical boundary cases, making your unlearning process more effective and resource-efficient, especially for complex datasets.
Key insights
Off-policy replay efficiently targets hard cases in LLM reinforcement unlearning, improving performance with minimal training overhead.
Principles
- On-policy RL unlearning wastes computation on easy cases.
- Reusing low-reward hard-case rollouts improves convergence.
- Replay benefits are proportional to hard/easy case disparity.
Method
ReRULE stores low-reward hard-case rollout groups in a replay buffer during early GRPO training, then reuses them in later stages via importance-sampled off-policy updates.
In practice
- Apply off-policy replay to focus unlearning on boundary cases.
- Prioritize replay for complex datasets with varied difficulty.
- Use replay to boost retain quality with minor training cost.
Topics
- LLM Unlearning
- Reinforcement Learning
- Off-Policy Replay
- GRPO
- MUSE-Books
- TOFU Dataset
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.