Aurora
Summary
Aurora, released on March 31, 2026, is an open-source, RL-based framework designed to address the issue of stale draft models in speculative decoding for large language models. It implements a "serve-to-train flywheel" that continuously updates the speculator from live inference traces without interrupting serving. This approach enables real-time adaptation to shifting traffic domains, achieving an additional 1.25x speedup over well-trained static speculators on models like Qwen3 and Llama3. Aurora also reduces infrastructure costs by eliminating large-scale activation-collection pipelines. Notably, online training from scratch with Aurora can outperform carefully pretrained static baselines, with acceptance length reaching 3.08 (surpassing both the static baseline at 2.63 and the pretrained-then-finetuned baseline at 2.99), with throughput stabilizing at 302.3 tokens/s in mixed traffic scenarios.
Key takeaway
For MLOps Engineers managing large language model deployments, Aurora offers a critical shift from static speculative decoding to a dynamic, self-improving system. You should consider integrating this open-source, RL-based framework to mitigate performance degradation from distribution shifts and reduce costly offline retraining pipelines. This can yield an additional 1.25x speedup and ensure your inference systems remain performant and cost-efficient in production.
Key insights
Aurora's RL-powered serve-to-train flywheel continuously adapts speculative decoding to live traffic, improving performance and reducing costs.
Principles
- Speculative decoding benefits from continuous online adaptation.
- Align training signals directly with real deployment utility.
- Decouple inference and asynchronous training for non-disruptive updates.
Method
Aurora uses an asynchronous RL approach where an Inference Server streams results to a distributed data buffer. A Training Server fetches data, updates a draft model copy, and hot-swaps improved weights back.
In practice
- Implement a serve-to-train loop for dynamic model updates.
- Re-frame speculative decoding as an RL problem for direct optimization.
- Utilize Tree Attention for efficient processing of speculative decoding results.
Topics
- Speculative Decoding
- Reinforcement Learning
- LLM Inference
- Online Learning
- Distribution Shift
- MLOps
Code references
Best for: NLP Engineer, AI Architect, MLOps Engineer, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Together AI | The AI Native Cloud - Together.ai.