Aurora

2026-03-31 · Source: Together AI | The AI Native Cloud - Together.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, medium

Summary

Aurora, released on March 31, 2026, is an open-source, RL-based framework designed to address the issue of stale draft models in speculative decoding for large language models. It implements a "serve-to-train flywheel" that continuously updates the speculator from live inference traces without interrupting serving. This approach enables real-time adaptation to shifting traffic domains, achieving an additional 1.25x speedup over well-trained static speculators on models like Qwen3 and Llama3. Aurora also reduces infrastructure costs by eliminating large-scale activation-collection pipelines. Notably, online training from scratch with Aurora can outperform carefully pretrained static baselines, with acceptance length reaching 3.08 (surpassing both the static baseline at 2.63 and the pretrained-then-finetuned baseline at 2.99), with throughput stabilizing at 302.3 tokens/s in mixed traffic scenarios.

Key takeaway

For MLOps Engineers managing large language model deployments, Aurora offers a critical shift from static speculative decoding to a dynamic, self-improving system. You should consider integrating this open-source, RL-based framework to mitigate performance degradation from distribution shifts and reduce costly offline retraining pipelines. This can yield an additional 1.25x speedup and ensure your inference systems remain performant and cost-efficient in production.

Key insights

Aurora's RL-powered serve-to-train flywheel continuously adapts speculative decoding to live traffic, improving performance and reducing costs.

Principles

Speculative decoding benefits from continuous online adaptation.
Align training signals directly with real deployment utility.
Decouple inference and asynchronous training for non-disruptive updates.

Method

Aurora uses an asynchronous RL approach where an Inference Server streams results to a distributed data buffer. A Training Server fetches data, updates a draft model copy, and hot-swaps improved weights back.

In practice

Implement a serve-to-train loop for dynamic model updates.
Re-frame speculative decoding as an RL problem for direct optimization.
Utilize Tree Attention for efficient processing of speculative decoding results.

Topics

Speculative Decoding
Reinforcement Learning
LLM Inference
Online Learning
Distribution Shift
MLOps

Code references

togethercomputer/aurora

Best for: NLP Engineer, AI Architect, MLOps Engineer, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Together AI | The AI Native Cloud - Together.ai.