Self-Augmenting Retrieval for Diffusion Language Models

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, extended

Summary

Self-Augmenting Retrieval for Diffusion Language Models (SARDI) is a dynamic Retrieval-Augmented Generation (RAG) framework designed for discrete diffusion language models (DLMs). It exploits the iterative denoising trajectory of DLMs, using low-confidence, tentative tokens as a "lookahead" signal for retrieval. These speculative tokens surface salient entities early in the generation process, enabling the retrieval of stronger evidence before the final output is committed. SARDI is training-free, retriever-agnostic, and compatible with any reasoning-capable discrete DLM like DREAM-7B. Across five multi-hop QA benchmarks, including 2WikiMultiHopQA, HotpotQA, and MuSiQue, SARDI significantly outperforms current training-free diffusion and autoregressive retrieval baselines, achieving up to 8x higher throughput. For instance, on 2WikiMultiHopQA, it raised Exact Match (EM) from 44% to 59%. The framework also demonstrates that RAG grounding substantially reduces inter-token dependence, which is beneficial for parallel decoding.

Key takeaway

For AI Scientists or Machine Learning Engineers evaluating non-autoregressive models for knowledge-intensive tasks, you should explore discrete diffusion language models (DLMs) combined with dynamic retrieval frameworks like SARDI. Implementing SARDI allows you to exploit DLM denoising trajectories for early evidence surfacing, particularly beneficial for multi-hop reasoning. This approach can deliver up to 8x faster performance and higher accuracy compared to traditional static or autoregressive RAG baselines, optimizing both efficiency and output quality.

Key insights

SARDI uses DLM tentative tokens as lookahead signals for dynamic retrieval, boosting RAG performance and throughput.

Principles

Diffusion trajectories provide a valuable lookahead for retrieval.
Low-confidence tokens can effectively guide dynamic retrieval.
RAG grounding significantly enhances parallel decoding efficiency.

Method

SARDI interleaves retrieval with denoising: it constructs a query from partially denoised sequences using tokens above a query threshold (τq), retrieves fresh evidence, and conditions the next step on the updated context.

In practice

Set the query threshold (τq) near 0 for maximal lookahead benefits.
Adjust the commit threshold (τc) to balance accuracy and throughput.
Consider document-level KV caching to amortize retrieval costs.

Topics

Diffusion Language Models
Retrieval-Augmented Generation
Dynamic Retrieval
Multi-hop QA
Parallel Decoding
DREAM-7B

Code references

pauljngr/SARDI

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.