DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training
Summary
DORA (Dynamic ORchestration for Asynchronous Rollout) is a scalable asynchronous reinforcement learning (RL) system designed to accelerate large language model (LLM) post-training, particularly in long-context scenarios. It addresses the rollout phase bottleneck, which accounts for 50-80% of total step time due to skewed generation where long-tailed trajectories block synchronous pipelines. DORA introduces "multi-version streaming training," which maintains multiple policy versions concurrently on rollout instances to eliminate generation bubbles without compromising algorithmic constraints like intra-trajectory policy consistency, data integrity, and bounded staleness. A centralized load-balancing orchestrator dynamically re-partitions resources and migrates requests, leveraging KV-Cache equivalence for zero-re-prefill migration. Experiments on open-source benchmarks show DORA achieves up to 2.12x end-to-end throughput and 8.2x rollout stage speedup compared to synchronous training, maintaining convergence parity. In industrial applications with thousands of accelerators, DORA accelerates the rollout stage up to 6.2x, producing competitive open-source models like LongCat-Flash-Thinking.
Key takeaway
For AI Scientists and Research Scientists optimizing LLM post-training, DORA offers a robust solution to the rollout bottleneck. By adopting its multi-version streaming and dynamic orchestration principles, you can achieve significant speedups (up to 8.2x in rollout) without sacrificing algorithmic convergence. Consider implementing DORA's KV-Cache reuse for efficient handling of long contexts and MoE architectures, which can drastically reduce re-prefill overheads and improve overall training throughput.
Key insights
Asynchronous RL for LLMs can achieve high efficiency and convergence by maintaining policy consistency and dynamic resource orchestration.
Principles
- Intra-trajectory policy consistency is crucial for RL convergence.
- Data integrity prevents loss of critical long-tailed trajectories.
- Bounded staleness limits policy lag for effective importance sampling.
Method
DORA uses multi-version streaming training with a load-balancing orchestrator and KV-Cache reuse to overlap generation and training, dynamically manage resources, and enable zero-re-prefill migration.
In practice
- Use multi-version streaming to eliminate rollout bottlenecks.
- Implement dynamic resource orchestration to prevent fragmentation.
- Exploit KV-Cache equivalence for zero-re-prefill migration.
Topics
- DORA System
- Asynchronous Reinforcement Learning
- LLM Post-training
- Multi-version Streaming Training
- Dynamic Resource Orchestration
Code references
Best for: AI Scientist, Research Scientist, AI Engineer, MLOps Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.