Google Shrank Gemma 4 by 72% and Unsloth Fixed the 4-Bit Bug Nobody Else Caught on One 4090, and…
Summary
Google DeepMind recently released Quantization-Aware Training (QAT) checkpoints for its entire Gemma 4 family, including E2B, E4B, 12B, the 26B-A4B mixture-of-experts, and the 31B dense models. These QAT checkpoints enable a 26-billion-parameter model to fit into 15GB of memory and achieve 193 tokens per second on a single consumer GPU, representing a 72% size reduction. Shortly after, Unsloth re-converted these into GGUF files, correcting a subtle 4-bit bug that boosted the 26B model's accuracy by 15 points over naive conversions. This combined effort demonstrates that 4-bit quantized models can retain near full-precision performance, challenging previous assumptions about quantization quality.
Key takeaway
For AI Engineers optimizing large language models for local deployment, this release fundamentally shifts expectations for 4-bit quantization. You can now achieve near full-precision performance with significantly reduced memory footprints, enabling powerful 26-billion-parameter models on consumer-grade GPUs. Prioritize evaluating Gemma 4 QAT checkpoints, especially Unsloth's GGUF conversions, to deploy high-quality models on cost-effective hardware without sacrificing accuracy.
Key insights
Google's Gemma 4 QAT and Unsloth's fix enable 4-bit quantized models to achieve near full-precision performance on consumer hardware.
Principles
- 4-bit quantization can retain near full-precision accuracy.
- Quantization-Aware Training (QAT) is key for high-quality 4-bit models.
Method
Google DeepMind shipped Quantization-Aware Training checkpoints for Gemma 4 models; Unsloth then re-converted them into GGUFs, fixing a subtle 4-bit bug for improved accuracy.
In practice
- Run 26B models on a single consumer GPU with 15GB VRAM.
- Utilize Unsloth's GGUF conversions for Gemma 4 models.
Topics
- Gemma 4
- Quantization-Aware Training
- 4-bit Quantization
- Large Language Models
- GGUF
- Unsloth
- Consumer GPUs
Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI - Medium.