Super Apriel: One Checkpoint, Many Speeds
Summary
ServiceNow AI has released "Super Apriel," a 15B-parameter supernet that allows for dynamic switching of four trained mixer choices per decoder layer: Full Attention (FA), Sliding Window Attention (SWA), Kimi Delta Attention (KDA), and Gated DeltaNet (GDN). This architecture enables multiple speed presets from a single checkpoint, supporting speculative decoding without a separate draft model. The all-FA preset matches the Apriel 1.6 teacher's performance on benchmarks, while recommended hybrid presets offer 2.9x to 10.7x decode throughput at 96% to 77% quality retention, with throughput advantages increasing at longer context lengths. A cluster-expansion surrogate model helps navigate the vast configuration space to identify optimal speed-quality tradeoffs. The supernet is trained via stochastic distillation from a frozen Apriel 1.6 teacher, followed by supervised fine-tuning, and includes open-source release of weights, training code (Fast-LLM), vLLM serving code, and a placement optimization toolkit.
Key takeaway
For AI Engineers deploying large language models, Super Apriel offers unprecedented runtime flexibility to adapt to varying workload demands. You can switch between high-quality, lower-throughput configurations and aggressive, high-throughput presets from a single deployment, optimizing for specific tasks or real-time load. This eliminates the need for multiple model deployments and retraining, streamlining MLOps workflows and reducing operational costs.
Key insights
Super Apriel offers dynamic, single-checkpoint control over LLM speed-quality tradeoffs via runtime mixer placement.
Principles
- Hybrid architectures balance speed and quality.
- Placement optimization is critical for efficiency.
- Small-scale findings may not transfer to large models.
Method
Super Apriel trains all four mixer types simultaneously via stochastic distillation, then uses a cluster-expansion surrogate with dynamic programming for cost-constrained placement optimization, followed by targeted supervised fine-tuning.
In practice
- Use Super Apriel for flexible LLM deployment.
- Leverage dynamic placement for workload adaptation.
- Employ speculative decoding with the shared checkpoint.
Topics
- Supernet Architecture
- Flexible LLM Deployment
- Token Mixers
- Placement Optimization
- Stochastic Distillation
Code references
- ServiceNow/Fast-LLM
- ServiceNow/pipeline-rl
- gkamradt/LLMTest_NeedleInAHaystack
- MoonshotAI/Kimi-Linear
- NVIDIA-NeMo/Skills
Best for: MLOps Engineer, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.