Super Apriel: One Checkpoint, Many Speeds

· Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

ServiceNow AI has released "Super Apriel," a 15B-parameter supernet that allows for dynamic switching of four trained mixer choices per decoder layer: Full Attention (FA), Sliding Window Attention (SWA), Kimi Delta Attention (KDA), and Gated DeltaNet (GDN). This architecture enables multiple speed presets from a single checkpoint, supporting speculative decoding without a separate draft model. The all-FA preset matches the Apriel 1.6 teacher's performance on benchmarks, while recommended hybrid presets offer 2.9x to 10.7x decode throughput at 96% to 77% quality retention, with throughput advantages increasing at longer context lengths. A cluster-expansion surrogate model helps navigate the vast configuration space to identify optimal speed-quality tradeoffs. The supernet is trained via stochastic distillation from a frozen Apriel 1.6 teacher, followed by supervised fine-tuning, and includes open-source release of weights, training code (Fast-LLM), vLLM serving code, and a placement optimization toolkit.

Key takeaway

For AI Engineers deploying large language models, Super Apriel offers unprecedented runtime flexibility to adapt to varying workload demands. You can switch between high-quality, lower-throughput configurations and aggressive, high-throughput presets from a single deployment, optimizing for specific tasks or real-time load. This eliminates the need for multiple model deployments and retraining, streamlining MLOps workflows and reducing operational costs.

Key insights

Super Apriel offers dynamic, single-checkpoint control over LLM speed-quality tradeoffs via runtime mixer placement.

Principles

Method

Super Apriel trains all four mixer types simultaneously via stochastic distillation, then uses a cluster-expansion surrogate with dynamic programming for cost-constrained placement optimization, followed by targeted supervised fine-tuning.

In practice

Topics

Code references

Best for: MLOps Engineer, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.