Making LLMs faster without sacrificing accuracy

2026-05-15 · Source: Amazon Science homepage · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, medium

Summary

Amazon Nova researchers have developed a new scaling law that integrates architectural choices directly into the Chinchilla framework, enabling the optimization of large language models (LLMs) for both accuracy and inference efficiency. This work, presented at ICLR 2026, addresses a gap in existing scaling laws by considering factors like hidden size, MLP-to-attention ratio, and grouped-query attention (GQA). By adjusting these architectural elements, the researchers identified "Surefire" models that match or exceed LLaMA-3.2 accuracy while improving throughput by up to 47% on H200 GPUs using SGLang. The study found that the optimal MLP-to-attention ratio for LLaMA-3.2-style models is around 1.0, significantly lower than the 4.8 ratio found in existing open-weight versions like LLaMA-3.2-1B, indicating current models often over-allocate parameters to MLP layers.

Key takeaway

For AI Engineers or MLOps teams optimizing LLM deployments, you should re-evaluate your model architectures, specifically the MLP-to-attention ratio and hidden size. Adopting architectures like the "Surefire" models, which prioritize an MLP-to-attention ratio around 1.0, can yield up to 47% throughput gains without accuracy loss, consistent across A100/H200 GPUs and vLLM/SGLang frameworks. This approach allows for significant efficiency improvements in real-time AI applications.

Key insights

Architectural choices significantly impact LLM inference efficiency without sacrificing accuracy, a gap addressed by new scaling laws.

Principles

Optimal MLP-to-attention ratio for LLaMA-3.2-style models is ~1.0.
Small-scale experiments reliably predict large-scale architectural outcomes.

Method

The method involves a two-stage scaling law deduction: first, fitting the standard Chinchilla law, then calibrating how architectural choices (hidden size, MLP-to-attention ratio, GQA) affect the optimal reference loss.

In practice

Adjust hidden size to reduce FLOPs and KV cache size.
Optimize MLP-to-attention ratio to reduce memory bottlenecks.
Utilize GQA to cut input/output costs during generation.

Topics

LLM Scaling Laws
Transformer Architecture
Inference Throughput
MLP-to-Attention Ratio
Grouped-Query Attention

Best for: MLOps Engineer, AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Amazon Science homepage.