RolloutPipe: Overlapping Pipelined Rollout and Training in Disaggregated On-Policy LLM Reinforcement Learning
Summary
RolloutPipe is a new post-training framework designed for disaggregated reinforcement learning with verifiable rewards (RLVR) systems used in large language model (LLM) post-training. It addresses the inefficiency of existing synchronous on-policy GRPO RLVR systems, which leave trainer GPUs idle during rollout generation, and avoids the stale data issue of asynchronous pipelines. RolloutPipe introduces complete-group pipelining (CGP) and frontier-group dispatch (FGD) to overlap rollout generation and policy training. CGP dispatches complete trainable groups to the trainer FIFO immediately upon materialization, while FGD prioritizes requests for frontier groups on the Rollout node. This design allows training to commence before the entire rollout completes, maintaining on-policy correctness. Evaluated on Qwen3-1.7B across four benchmarks, RolloutPipe shortens the rollout-to-train-end time by 30.7%-42.3% and lowers the trainer waiting ratio by 37%-76% compared to Slime, a state-of-the-art system.
Key takeaway
For MLOps Engineers optimizing large language model reinforcement learning with verifiable rewards (RLVR) post-training, RolloutPipe demonstrates a critical path to improved GPU utilization. You should investigate pipelined rollout and training strategies to reduce trainer idle time by 37%-76% and shorten overall training cycles by 30.7%-42.3%. Adopting techniques like complete-group pipelining and frontier-group dispatch can significantly enhance throughput and resource efficiency in disaggregated RLVR systems without compromising on-policy correctness.
Key insights
RolloutPipe overlaps LLM RLVR rollout and training to reduce GPU idle time while maintaining on-policy correctness.
Principles
- Decouple rollout and training for flexible resource use.
- Prioritize data dispatch to minimize trainer idle time.
- Maintain on-policy correctness during pipelined operations.
Method
RolloutPipe employs Complete-Group Pipelining (CGP) to dispatch materialized groups to the trainer FIFO and Frontier-Group Dispatch (FGD) to prioritize frontier group requests for faster batch formation.
In practice
- Apply CGP to send complete data groups early.
- Implement FGD to accelerate training batch readiness.
Topics
- Large Language Models
- Reinforcement Learning
- Pipelining
- GPU Optimization
- On-Policy Learning
- Disaggregated Systems
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.