RolloutPipe: Overlapping Pipelined Rollout and Training in Disaggregated On-Policy LLM Reinforcement Learning

2026-06-25 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Expert, quick

Summary

RolloutPipe is a new post-training framework designed for disaggregated reinforcement learning with verifiable rewards (RLVR) systems used in large language model (LLM) post-training. It addresses the inefficiency of existing synchronous on-policy GRPO RLVR systems, which leave trainer GPUs idle during rollout generation, and avoids the stale data issue of asynchronous pipelines. RolloutPipe introduces complete-group pipelining (CGP) and frontier-group dispatch (FGD) to overlap rollout generation and policy training. CGP dispatches complete trainable groups to the trainer FIFO immediately upon materialization, while FGD prioritizes requests for frontier groups on the Rollout node. This design allows training to commence before the entire rollout completes, maintaining on-policy correctness. Evaluated on Qwen3-1.7B across four benchmarks, RolloutPipe shortens the rollout-to-train-end time by 30.7%-42.3% and lowers the trainer waiting ratio by 37%-76% compared to Slime, a state-of-the-art system.

Key takeaway

For MLOps Engineers optimizing large language model reinforcement learning with verifiable rewards (RLVR) post-training, RolloutPipe demonstrates a critical path to improved GPU utilization. You should investigate pipelined rollout and training strategies to reduce trainer idle time by 37%-76% and shorten overall training cycles by 30.7%-42.3%. Adopting techniques like complete-group pipelining and frontier-group dispatch can significantly enhance throughput and resource efficiency in disaggregated RLVR systems without compromising on-policy correctness.

Key insights

RolloutPipe overlaps LLM RLVR rollout and training to reduce GPU idle time while maintaining on-policy correctness.

Principles

Decouple rollout and training for flexible resource use.
Prioritize data dispatch to minimize trainer idle time.
Maintain on-policy correctness during pipelined operations.

Method

RolloutPipe employs Complete-Group Pipelining (CGP) to dispatch materialized groups to the trainer FIFO and Frontier-Group Dispatch (FGD) to prioritize frontier group requests for faster batch formation.

In practice

Apply CGP to send complete data groups early.
Implement FGD to accelerate training batch readiness.

Topics

Large Language Models
Reinforcement Learning
Pipelining
GPU Optimization
On-Policy Learning
Disaggregated Systems

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.