New DiffusionGemma and MoQ GGUFs for Gemma 4 12B and LFM2.5 8B A1B

2026-04-15 · Source: The Kaitchup – AI on a Budget · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Expert, quick

Summary

Google DeepMind introduced DiffusionGemma, an experimental text-diffusion variant of Gemma 4 26B A4B, designed for faster text generation. This model, based on a Mixture-of-Experts architecture with 25.2B total parameters and 3.8B active, generates text in 256-token blocks by repeatedly denoising a canvas, rather than token-by-token. It achieves up to 4x faster generation, with speeds exceeding 1100 tokens/s on H100 FP8 in low-batch settings, making it ideal for local inference. However, DiffusionGemma significantly underperforms the original 26B model in accuracy. Separately, new Mixture-of-Quantization (MoQ) GGUFs for Gemma 4 12B IT and LFM2.5 8B A1B have been released, demonstrating superior performance over other GGUFs, including APEX, for sub-7 GB models.

Key takeaway

For Machine Learning Engineers optimizing local inference, you should evaluate DiffusionGemma for tasks prioritizing generation speed over accuracy, especially for editing or structured text where its 4x speed advantage could be critical. Conversely, if you need highly efficient, smaller models, explore the new MoQ GGUFs for Gemma 4 12B IT or LFM2.5 8B A1B, as they offer superior performance for sub-7 GB models compared to other quantization methods.

Key insights

DiffusionGemma offers speed over accuracy, while MoQ GGUFs enhance quantized model performance.

Principles

Diffusion models can accelerate text generation.
Parallel GPU workloads improve inference speed.
MoQ GGUFs outperform other quantization methods.

Method

DiffusionGemma uses a block-autoregressive discrete diffusion process, denoising a 256-token canvas iteratively. It predicts tokens, estimates uncertainty, keeps confident positions, and re-noises uncertain ones using Entropy-Bounded Denoising.

In practice

Use DiffusionGemma for speed-critical local inference.
Consider MoQ GGUFs for efficient sub-7 GB models.
Apply diffusion for editing or structured text tasks.

Topics

Diffusion Models
Large Language Models
Gemma 4
Model Quantization
GGUF
Inference Optimization

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Kaitchup – AI on a Budget.