Gemma 4 is not your standard transformer
Summary
Gemma 4, an open-weight model, introduces several non-standard architectural choices that challenge conventional transformer design. Key modifications include replacing the traditional `1/sqrt(d_head)` attention scaling with QK-norm, where queries and keys are RMS-normalized before the dot product, allowing learned per-dimension scaling. Its hybrid attention stack uses five local (sliding-window) layers for every one global (full-context) layer, with global layers employing partial RoPE on only the first 25% of dimension pairs and a higher `theta` base for long-range context. Additionally, Gemma 4 incorporates per-layer input gating, providing each layer a direct, independent signal derived from the original input token identity, bypassing the residual stream. The model also features KV sharing across the last several layers to reduce KV cache memory and utilizes a Mixture-of-Experts (MoE) block that runs in parallel with a full dense MLP, ensuring every token receives a dense feed-forward pass regardless of routing.
Key takeaway
For AI Engineers designing large language models, Gemma 4's architectural choices highlight potential areas for innovation beyond standard transformer designs. You should investigate the implications of QK-norm for attention stability and the utility of per-layer input gating for preserving token identity, especially when scaling to very deep models or extended context windows. These departures from convention suggest that re-evaluating long-held assumptions can yield significant performance or efficiency gains.
Key insights
Gemma 4's architectural deviations suggest a re-evaluation of core transformer assumptions, particularly regarding token information flow.
Principles
- Learned scaling can replace fixed attention scaling.
- Positional encoding can be partially applied for long contexts.
- Token identity may need explicit preservation across layers.
Method
Gemma 4 employs QK-norm for attention scaling, partial RoPE for global layers, per-layer input gating for token identity, and KV sharing, alongside a parallel MoE and dense MLP.
In practice
- Implement QK-norm with RMSNorm for flexible attention scaling.
- Consider partial RoPE for models with very long context windows.
- Explore per-layer input gating to reinforce token identity.
Topics
- QK-norm
- Partial RoPE
- Per-layer Input Gating
- KV Sharing
- Mixture-of-Experts
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI Advances - Medium.