Gemma 4 is not your standard transformer

· Source: AI Advances - Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, long

Summary

Gemma 4, an open-weight model, introduces several non-standard architectural choices that challenge conventional transformer design. Key modifications include replacing the traditional `1/sqrt(d_head)` attention scaling with QK-norm, where queries and keys are RMS-normalized before the dot product, allowing learned per-dimension scaling. Its hybrid attention stack uses five local (sliding-window) layers for every one global (full-context) layer, with global layers employing partial RoPE on only the first 25% of dimension pairs and a higher `theta` base for long-range context. Additionally, Gemma 4 incorporates per-layer input gating, providing each layer a direct, independent signal derived from the original input token identity, bypassing the residual stream. The model also features KV sharing across the last several layers to reduce KV cache memory and utilizes a Mixture-of-Experts (MoE) block that runs in parallel with a full dense MLP, ensuring every token receives a dense feed-forward pass regardless of routing.

Key takeaway

For AI Engineers designing large language models, Gemma 4's architectural choices highlight potential areas for innovation beyond standard transformer designs. You should investigate the implications of QK-norm for attention stability and the utility of per-layer input gating for preserving token identity, especially when scaling to very deep models or extended context windows. These departures from convention suggest that re-evaluating long-held assumptions can yield significant performance or efficiency gains.

Key insights

Gemma 4's architectural deviations suggest a re-evaluation of core transformer assumptions, particularly regarding token information flow.

Principles

Method

Gemma 4 employs QK-norm for attention scaling, partial RoPE for global layers, per-layer input gating for token identity, and KV sharing, alongside a parallel MoE and dense MLP.

In practice

Topics

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Advances - Medium.