$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

2026-04-22 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

R2-dLLM is a new framework designed to accelerate Diffusion Large Language Models (dLLMs) by addressing spatio-temporal redundancy during the decoding process. dLLMs offer parallel token prediction, but their inference latency remains a significant deployment bottleneck. The R2-dLLM framework identifies two primary sources of inefficiency: spatial redundancy from confidence clusters and positional ambiguity, and temporal redundancy from repeatedly remasking stable predictions. It introduces training-free decoding rules that aggregate local confidence and token predictions, along with a mechanism to finalize stable tokens, thereby avoiding redundant decoding steps. Additionally, R2-dLLM incorporates a redundancy-aware supervised fine-tuning pipeline to align the model with efficient decoding trajectories. Experiments show that R2-dLLM reduces decoding steps by up to 75% compared to existing strategies while maintaining competitive generation quality across various models and tasks.

Key takeaway

For AI Engineers deploying Diffusion Large Language Models, R2-dLLM offers a direct path to significantly reduce inference latency. By implementing its redundancy reduction techniques, you can achieve up to a 75% reduction in decoding steps without compromising output quality. This allows for more efficient resource utilization and faster response times, making dLLMs more viable for real-time applications and broader deployment scenarios.

Key insights

Spatio-temporal redundancy is a key bottleneck in dLLM inference, significantly impacting decoding speed.

Principles

Aggregate local confidence and token predictions.
Finalize temporally stable tokens early.
Align models with efficient decoding trajectories.

Method

R2-dLLM employs training-free decoding rules to aggregate local confidence and finalize stable tokens, complemented by a redundancy-aware supervised fine-tuning pipeline to optimize decoding trajectories.

In practice

Reduce dLLM decoding steps by up to 75%.
Maintain generation quality with faster inference.
Apply to various dLLM models and tasks.

Topics

Diffusion Large Language Models
Inference Latency
Spatio-Temporal Redundancy
Decoding Acceleration
Supervised Fine-tuning

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.