Super Apriel: One Checkpoint, Many Speeds

2026-02-17 · Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

ServiceNow AI has released "Super Apriel," a 15B-parameter supernet that allows for dynamic switching of four trained mixer choices per decoder layer: Full Attention (FA), Sliding Window Attention (SWA), Kimi Delta Attention (KDA), and Gated DeltaNet (GDN). This architecture enables multiple speed presets from a single checkpoint, supporting speculative decoding without a separate draft model. The all-FA preset matches the Apriel 1.6 teacher's performance on benchmarks, while recommended hybrid presets offer 2.9x to 10.7x decode throughput at 96% to 77% quality retention, with throughput advantages increasing at longer context lengths. A cluster-expansion surrogate model helps navigate the vast configuration space to identify optimal speed-quality tradeoffs. The supernet is trained via stochastic distillation from a frozen Apriel 1.6 teacher, followed by supervised fine-tuning, and includes open-source release of weights, training code (Fast-LLM), vLLM serving code, and a placement optimization toolkit.

Key takeaway

For AI Engineers deploying large language models, Super Apriel offers unprecedented runtime flexibility to adapt to varying workload demands. You can switch between high-quality, lower-throughput configurations and aggressive, high-throughput presets from a single deployment, optimizing for specific tasks or real-time load. This eliminates the need for multiple model deployments and retraining, streamlining MLOps workflows and reducing operational costs.

Key insights

Super Apriel offers dynamic, single-checkpoint control over LLM speed-quality tradeoffs via runtime mixer placement.

Principles

Hybrid architectures balance speed and quality.
Placement optimization is critical for efficiency.
Small-scale findings may not transfer to large models.

Method

Super Apriel trains all four mixer types simultaneously via stochastic distillation, then uses a cluster-expansion surrogate with dynamic programming for cost-constrained placement optimization, followed by targeted supervised fine-tuning.

In practice

Use Super Apriel for flexible LLM deployment.
Leverage dynamic placement for workload adaptation.
Employ speculative decoding with the shared checkpoint.

Topics

Supernet Architecture
Flexible LLM Deployment
Token Mixers
Placement Optimization
Stochastic Distillation

Code references

Best for: MLOps Engineer, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.