Google Shrank Gemma 4 by 72% and Unsloth Fixed the 4-Bit Bug Nobody Else Caught on One 4090, and…

2026-06-08 · Source: Towards AI - Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Advanced, quick

Summary

Google DeepMind recently released Quantization-Aware Training (QAT) checkpoints for its entire Gemma 4 family, including E2B, E4B, 12B, the 26B-A4B mixture-of-experts, and the 31B dense models. These QAT checkpoints enable a 26-billion-parameter model to fit into 15GB of memory and achieve 193 tokens per second on a single consumer GPU, representing a 72% size reduction. Shortly after, Unsloth re-converted these into GGUF files, correcting a subtle 4-bit bug that boosted the 26B model's accuracy by 15 points over naive conversions. This combined effort demonstrates that 4-bit quantized models can retain near full-precision performance, challenging previous assumptions about quantization quality.

Key takeaway

For AI Engineers optimizing large language models for local deployment, this release fundamentally shifts expectations for 4-bit quantization. You can now achieve near full-precision performance with significantly reduced memory footprints, enabling powerful 26-billion-parameter models on consumer-grade GPUs. Prioritize evaluating Gemma 4 QAT checkpoints, especially Unsloth's GGUF conversions, to deploy high-quality models on cost-effective hardware without sacrificing accuracy.

Key insights

Google's Gemma 4 QAT and Unsloth's fix enable 4-bit quantized models to achieve near full-precision performance on consumer hardware.

Principles

4-bit quantization can retain near full-precision accuracy.
Quantization-Aware Training (QAT) is key for high-quality 4-bit models.

Method

Google DeepMind shipped Quantization-Aware Training checkpoints for Gemma 4 models; Unsloth then re-converted them into GGUFs, fixing a subtle 4-bit bug for improved accuracy.

In practice

Run 26B models on a single consumer GPU with 15GB VRAM.
Utilize Unsloth's GGUF conversions for Gemma 4 models.

Topics

Gemma 4
Quantization-Aware Training
4-bit Quantization
Large Language Models
GGUF
Unsloth
Consumer GPUs

Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI - Medium.