DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training

2026-04-30 · Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Expert, extended

Summary

DORA (Dynamic ORchestration for Asynchronous Rollout) is a scalable asynchronous reinforcement learning (RL) system designed to accelerate large language model (LLM) post-training, particularly in long-context scenarios. It addresses the rollout phase bottleneck, which accounts for 50-80% of total step time due to skewed generation where long-tailed trajectories block synchronous pipelines. DORA introduces "multi-version streaming training," which maintains multiple policy versions concurrently on rollout instances to eliminate generation bubbles without compromising algorithmic constraints like intra-trajectory policy consistency, data integrity, and bounded staleness. A centralized load-balancing orchestrator dynamically re-partitions resources and migrates requests, leveraging KV-Cache equivalence for zero-re-prefill migration. Experiments on open-source benchmarks show DORA achieves up to 2.12x end-to-end throughput and 8.2x rollout stage speedup compared to synchronous training, maintaining convergence parity. In industrial applications with thousands of accelerators, DORA accelerates the rollout stage up to 6.2x, producing competitive open-source models like LongCat-Flash-Thinking.

Key takeaway

For AI Scientists and Research Scientists optimizing LLM post-training, DORA offers a robust solution to the rollout bottleneck. By adopting its multi-version streaming and dynamic orchestration principles, you can achieve significant speedups (up to 8.2x in rollout) without sacrificing algorithmic convergence. Consider implementing DORA's KV-Cache reuse for efficient handling of long contexts and MoE architectures, which can drastically reduce re-prefill overheads and improve overall training throughput.

Key insights

Asynchronous RL for LLMs can achieve high efficiency and convergence by maintaining policy consistency and dynamic resource orchestration.

Principles

Intra-trajectory policy consistency is crucial for RL convergence.
Data integrity prevents loss of critical long-tailed trajectories.
Bounded staleness limits policy lag for effective importance sampling.

Method

DORA uses multi-version streaming training with a load-balancing orchestrator and KV-Cache reuse to overlap generation and training, dynamically manage resources, and enable zero-re-prefill migration.

In practice

Use multi-version streaming to eliminate rollout bottlenecks.
Implement dynamic resource orchestration to prevent fragmentation.
Exploit KV-Cache equivalence for zero-re-prefill migration.

Topics

DORA System
Asynchronous Reinforcement Learning
LLM Post-training
Multi-version Streaming Training
Dynamic Resource Orchestration

Code references

THUDM/slime

Best for: AI Scientist, Research Scientist, AI Engineer, MLOps Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.