Making LLMs faster without sacrificing accuracy
Summary
Amazon Nova researchers have developed a new scaling law that integrates architectural choices directly into the Chinchilla framework, enabling the optimization of large language models (LLMs) for both accuracy and inference efficiency. This work, presented at ICLR 2026, addresses a gap in existing scaling laws by considering factors like hidden size, MLP-to-attention ratio, and grouped-query attention (GQA). By adjusting these architectural elements, the researchers identified "Surefire" models that match or exceed LLaMA-3.2 accuracy while improving throughput by up to 47% on H200 GPUs using SGLang. The study found that the optimal MLP-to-attention ratio for LLaMA-3.2-style models is around 1.0, significantly lower than the 4.8 ratio found in existing open-weight versions like LLaMA-3.2-1B, indicating current models often over-allocate parameters to MLP layers.
Key takeaway
For AI Engineers or MLOps teams optimizing LLM deployments, you should re-evaluate your model architectures, specifically the MLP-to-attention ratio and hidden size. Adopting architectures like the "Surefire" models, which prioritize an MLP-to-attention ratio around 1.0, can yield up to 47% throughput gains without accuracy loss, consistent across A100/H200 GPUs and vLLM/SGLang frameworks. This approach allows for significant efficiency improvements in real-time AI applications.
Key insights
Architectural choices significantly impact LLM inference efficiency without sacrificing accuracy, a gap addressed by new scaling laws.
Principles
- Optimal MLP-to-attention ratio for LLaMA-3.2-style models is ~1.0.
- Small-scale experiments reliably predict large-scale architectural outcomes.
Method
The method involves a two-stage scaling law deduction: first, fitting the standard Chinchilla law, then calibrating how architectural choices (hidden size, MLP-to-attention ratio, GQA) affect the optimal reference loss.
In practice
- Adjust hidden size to reduce FLOPs and KV cache size.
- Optimize MLP-to-attention ratio to reduce memory bottlenecks.
- Utilize GQA to cut input/output costs during generation.
Topics
- LLM Scaling Laws
- Transformer Architecture
- Inference Throughput
- MLP-to-Attention Ratio
- Grouped-Query Attention
Best for: MLOps Engineer, AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Amazon Science homepage.