Reinforcement Learning from Denoising Feedback

2026-05-19 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

Reinforcement Learning from Denoising Feedback (RLDF) is a novel training paradigm addressing policy loss estimation in diffusion language models (dLLMs). RLDF improves dLLM reasoning by leveraging rollout and training feedback, optimizing models towards a clipped clean state \"x_0\" from intermediate noisy states x_t with weighted timestep sampling. This approach achieves a favorable precision-efficiency trade-off. Experiments show RLDF delivers consistent and substantial performance gains, up to 10 accuracy points on Dream for MATH500 and MBPP, and approximately 8 points for LLaDA on the same benchmarks. It enhances generalizability across LLaDA and Dream architectures on math and code reasoning tasks, and the associated Drift training framework is open-sourced.

Key takeaway

For AI Scientists and ML Engineers developing diffusion language models, adopting RLDF is crucial for improving reasoning capabilities and training stability. Your teams should integrate weighted timestep sampling and \"x_0\" estimation, as these techniques significantly enhance policy loss estimation efficiency and precision. This approach yields substantial performance gains on math and code benchmarks, offering a robust framework for scalable dLLM reinforcement learning.

Key insights

RLDF enhances dLLM reasoning by efficiently estimating policy loss via weighted sampling of denoising feedback and clean state prediction.

Principles

Prioritize low-confidence denoising steps for gradient signal.
Predict clean state \"x_0\" for stable RL training.
Token-level clipping improves signal-to-noise ratio.

Method

RLDF collects rollout and training feedback, then uses weighted sampling based on token uncertainty to select denoising steps. Policy loss is computed on clipped \"x_0\" for these steps, aggregated with KL regularization.

In practice

Implement weighted timestep sampling for dLLM RL.
Utilize \"x_0\" estimation over x_t-1 for stability.
Apply token clipping to filter low-probability tokens.

Topics

Diffusion Language Models
Reinforcement Learning
Policy Gradient Estimation
Reasoning Benchmarks
Weighted Sampling
Drift Framework

Code references

ant-research/Drift

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.