New DiffusionGemma and MoQ GGUFs for Gemma 4 12B and LFM2.5 8B A1B
Summary
Google DeepMind introduced DiffusionGemma, an experimental text-diffusion variant of Gemma 4 26B A4B, designed for faster text generation. This model, based on a Mixture-of-Experts architecture with 25.2B total parameters and 3.8B active, generates text in 256-token blocks by repeatedly denoising a canvas, rather than token-by-token. It achieves up to 4x faster generation, with speeds exceeding 1100 tokens/s on H100 FP8 in low-batch settings, making it ideal for local inference. However, DiffusionGemma significantly underperforms the original 26B model in accuracy. Separately, new Mixture-of-Quantization (MoQ) GGUFs for Gemma 4 12B IT and LFM2.5 8B A1B have been released, demonstrating superior performance over other GGUFs, including APEX, for sub-7 GB models.
Key takeaway
For Machine Learning Engineers optimizing local inference, you should evaluate DiffusionGemma for tasks prioritizing generation speed over accuracy, especially for editing or structured text where its 4x speed advantage could be critical. Conversely, if you need highly efficient, smaller models, explore the new MoQ GGUFs for Gemma 4 12B IT or LFM2.5 8B A1B, as they offer superior performance for sub-7 GB models compared to other quantization methods.
Key insights
DiffusionGemma offers speed over accuracy, while MoQ GGUFs enhance quantized model performance.
Principles
- Diffusion models can accelerate text generation.
- Parallel GPU workloads improve inference speed.
- MoQ GGUFs outperform other quantization methods.
Method
DiffusionGemma uses a block-autoregressive discrete diffusion process, denoising a 256-token canvas iteratively. It predicts tokens, estimates uncertainty, keeps confident positions, and re-noises uncertain ones using Entropy-Bounded Denoising.
In practice
- Use DiffusionGemma for speed-critical local inference.
- Consider MoQ GGUFs for efficient sub-7 GB models.
- Apply diffusion for editing or structured text tasks.
Topics
- Diffusion Models
- Large Language Models
- Gemma 4
- Model Quantization
- GGUF
- Inference Optimization
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Kaitchup – AI on a Budget.