Reinforcement Learning from Denoising Feedback
Summary
Reinforcement Learning from Denoising Feedback (RLDF) is a novel training paradigm addressing policy loss estimation in diffusion language models (dLLMs). RLDF improves dLLM reasoning by leveraging rollout and training feedback, optimizing models towards a clipped clean state \"x_0\" from intermediate noisy states x_t with weighted timestep sampling. This approach achieves a favorable precision-efficiency trade-off. Experiments show RLDF delivers consistent and substantial performance gains, up to 10 accuracy points on Dream for MATH500 and MBPP, and approximately 8 points for LLaDA on the same benchmarks. It enhances generalizability across LLaDA and Dream architectures on math and code reasoning tasks, and the associated Drift training framework is open-sourced.
Key takeaway
For AI Scientists and ML Engineers developing diffusion language models, adopting RLDF is crucial for improving reasoning capabilities and training stability. Your teams should integrate weighted timestep sampling and \"x_0\" estimation, as these techniques significantly enhance policy loss estimation efficiency and precision. This approach yields substantial performance gains on math and code benchmarks, offering a robust framework for scalable dLLM reinforcement learning.
Key insights
RLDF enhances dLLM reasoning by efficiently estimating policy loss via weighted sampling of denoising feedback and clean state prediction.
Principles
- Prioritize low-confidence denoising steps for gradient signal.
- Predict clean state \"x_0\" for stable RL training.
- Token-level clipping improves signal-to-noise ratio.
Method
RLDF collects rollout and training feedback, then uses weighted sampling based on token uncertainty to select denoising steps. Policy loss is computed on clipped \"x_0\" for these steps, aggregated with KL regularization.
In practice
- Implement weighted timestep sampling for dLLM RL.
- Utilize \"x_0\" estimation over x_t-1 for stability.
- Apply token clipping to filter low-probability tokens.
Topics
- Diffusion Language Models
- Reinforcement Learning
- Policy Gradient Estimation
- Reasoning Benchmarks
- Weighted Sampling
- Drift Framework
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.