Run DiffusionGemma on NVIDIA for Developer-Ready, High-Throughput Text Generation

2026-06-10 · Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Intermediate, quick

Summary

DiffusionGemma, a new text generation model developed by Google DeepMind, introduces a novel diffusion-based denoising approach to produce tokens in parallel, significantly enhancing throughput for real-time AI applications. Optimized for NVIDIA platforms, it generates 256 tokens per step, achieving speeds up to 1,000 tokens/sec on a single NVIDIA H100 Tensor Core GPU and 150 tokens/sec on NVIDIA DGX Spark. Built on the Gemma 4 26B A4B MoE architecture, DiffusionGemma is designed for low-latency, memory-bound inference, reducing serving costs and improving responsiveness. Developers can access the model via Hugging Face Transformers, including BF16 and NVFP4 quantized checkpoints, and utilize NVIDIA GPU-accelerated endpoints for prototyping. For production, NVIDIA NIM offers containerized deployment with a standard OpenAI-compatible API, while NVIDIA NeMo AutoModel supports fine-tuning.

Key takeaway

For AI Engineers building real-time chat assistants or agentic workflows, DiffusionGemma offers a critical solution to token-by-token generation speed constraints. Its parallel token generation capability, delivering up to 1,000 tokens/sec on NVIDIA H100, directly translates to lower serving costs and more responsive user experiences. You should explore integrating DiffusionGemma via NVIDIA NIM for streamlined production deployment or utilize NeMo AutoModel for efficient fine-tuning to specific tasks.

Key insights

DiffusionGemma employs diffusion-based denoising to generate text tokens in parallel, achieving high throughput for real-time AI.

Principles

Diffusion-based denoising enables parallel token generation.
Hardware-specific optimization significantly boosts inference throughput.

Method

Deploy DiffusionGemma using NVIDIA NIM by downloading the container, starting the server, and sending requests via an OpenAI-compatible API. Fine-tune with NVIDIA NeMo AutoModel.

In practice

Prototype DiffusionGemma on Hugging Face Transformers with NVIDIA RTX 5090.
Use vLLM for high-throughput serving on DGX Spark/Station.
Adapt DiffusionGemma to tasks using NVIDIA NeMo AutoModel.

Topics

DiffusionGemma
Text Generation
Parallel Inference
NVIDIA GPUs
NVIDIA NIM
NeMo AutoModel

Code references

Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.