Sparrow: Sparse Rollout for Stable and Efficient Long-context RL of Large Language Models

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Sparrow introduces a novel method for "Sparse Rollout for Stable and Efficient Long-context RL of Large Language Models," addressing the high computational cost of Reinforcement Learning with Verifiable Rewards (RLVR) due to extensive Chain-of-Thought (COT) rollouts. While sparse attention can accelerate dense rollouts, it presents a stability-efficiency dilemma where aggressive sparsity leads to model collapse. This work investigates the sparse-to-dense actor-policy mismatch, observing that collapse isn't uniform across tokens. The authors hypothesize that stable training requires the lower tail of per-token actor-policy mismatch to remain above a critical threshold. They propose a dynamic sparsity schedule that maintains this tail statistic during generation, enabling stable training across Qwen3 thinking-family models. This approach achieved rollout speedups of 2.2x for Qwen3-1.7B, 2.4x for Qwen3-4B, and 2.0x for Qwen3-8B, with thresholds generalizing to Qwen3-14B and coding RL. Additionally, DistillSparse, a LoRA-based distillation technique, further enhances speedup by allowing more aggressive sparsity.

Key takeaway

For Machine Learning Engineers optimizing long-context Reinforcement Learning with Verifiable Rewards, Sparrow offers a critical solution to the stability-efficiency tradeoff. You should consider implementing dynamic sparsity schedules based on per-token actor-policy mismatch to achieve stable training and significant rollout speedups, as demonstrated by 2.0x-2.4x gains on Qwen3 models. Furthermore, explore integrating DistillSparse to push sparsity limits and maximize efficiency in your RL workflows.

Key insights

Dynamic sparsity schedules based on per-token actor-policy mismatch stabilize long-context RL, enabling significant speedups.

Principles

Method

Introduce a dynamic sparsity schedule that keeps the lower tail of per-token actor-policy mismatch constant during generation to ensure stable training.

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.