DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training

· Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Expert, extended

Summary

DORA (Dynamic ORchestration for Asynchronous Rollout) is a scalable asynchronous reinforcement learning (RL) system designed to accelerate large language model (LLM) post-training, particularly in long-context scenarios. It addresses the rollout phase bottleneck, which accounts for 50-80% of total step time due to skewed generation where long-tailed trajectories block synchronous pipelines. DORA introduces "multi-version streaming training," which maintains multiple policy versions concurrently on rollout instances to eliminate generation bubbles without compromising algorithmic constraints like intra-trajectory policy consistency, data integrity, and bounded staleness. A centralized load-balancing orchestrator dynamically re-partitions resources and migrates requests, leveraging KV-Cache equivalence for zero-re-prefill migration. Experiments on open-source benchmarks show DORA achieves up to 2.12x end-to-end throughput and 8.2x rollout stage speedup compared to synchronous training, maintaining convergence parity. In industrial applications with thousands of accelerators, DORA accelerates the rollout stage up to 6.2x, producing competitive open-source models like LongCat-Flash-Thinking.

Key takeaway

For AI Scientists and Research Scientists optimizing LLM post-training, DORA offers a robust solution to the rollout bottleneck. By adopting its multi-version streaming and dynamic orchestration principles, you can achieve significant speedups (up to 8.2x in rollout) without sacrificing algorithmic convergence. Consider implementing DORA's KV-Cache reuse for efficient handling of long contexts and MoE architectures, which can drastically reduce re-prefill overheads and improve overall training throughput.

Key insights

Asynchronous RL for LLMs can achieve high efficiency and convergence by maintaining policy consistency and dynamic resource orchestration.

Principles

Method

DORA uses multi-version streaming training with a load-balancing orchestrator and KV-Cache reuse to overlap generation and training, dynamically manage resources, and enable zero-re-prefill migration.

In practice

Topics

Code references

Best for: AI Scientist, Research Scientist, AI Engineer, MLOps Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.