RolloutPipe: Overlapping Pipelined Rollout and Training in Disaggregated On-Policy LLM Reinforcement Learning

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Expert, quick

Summary

RolloutPipe is a new post-training framework designed for disaggregated reinforcement learning with verifiable rewards (RLVR) systems used in large language model (LLM) post-training. It addresses the inefficiency of existing synchronous on-policy GRPO RLVR systems, which leave trainer GPUs idle during rollout generation, and avoids the stale data issue of asynchronous pipelines. RolloutPipe introduces complete-group pipelining (CGP) and frontier-group dispatch (FGD) to overlap rollout generation and policy training. CGP dispatches complete trainable groups to the trainer FIFO immediately upon materialization, while FGD prioritizes requests for frontier groups on the Rollout node. This design allows training to commence before the entire rollout completes, maintaining on-policy correctness. Evaluated on Qwen3-1.7B across four benchmarks, RolloutPipe shortens the rollout-to-train-end time by 30.7%-42.3% and lowers the trainer waiting ratio by 37%-76% compared to Slime, a state-of-the-art system.

Key takeaway

For MLOps Engineers optimizing large language model reinforcement learning with verifiable rewards (RLVR) post-training, RolloutPipe demonstrates a critical path to improved GPU utilization. You should investigate pipelined rollout and training strategies to reduce trainer idle time by 37%-76% and shorten overall training cycles by 30.7%-42.3%. Adopting techniques like complete-group pipelining and frontier-group dispatch can significantly enhance throughput and resource efficiency in disaggregated RLVR systems without compromising on-policy correctness.

Key insights

RolloutPipe overlaps LLM RLVR rollout and training to reduce GPU idle time while maintaining on-policy correctness.

Principles

Method

RolloutPipe employs Complete-Group Pipelining (CGP) to dispatch materialized groups to the trainer FIFO and Frontier-Group Dispatch (FGD) to prioritize frontier group requests for faster batch formation.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.