Reinforcement Learning from Denoising Feedback

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

Reinforcement Learning from Denoising Feedback (RLDF) is a novel training paradigm addressing policy loss estimation in diffusion language models (dLLMs). RLDF improves dLLM reasoning by leveraging rollout and training feedback, optimizing models towards a clipped clean state \"x_0\" from intermediate noisy states x_t with weighted timestep sampling. This approach achieves a favorable precision-efficiency trade-off. Experiments show RLDF delivers consistent and substantial performance gains, up to 10 accuracy points on Dream for MATH500 and MBPP, and approximately 8 points for LLaDA on the same benchmarks. It enhances generalizability across LLaDA and Dream architectures on math and code reasoning tasks, and the associated Drift training framework is open-sourced.

Key takeaway

For AI Scientists and ML Engineers developing diffusion language models, adopting RLDF is crucial for improving reasoning capabilities and training stability. Your teams should integrate weighted timestep sampling and \"x_0\" estimation, as these techniques significantly enhance policy loss estimation efficiency and precision. This approach yields substantial performance gains on math and code benchmarks, offering a robust framework for scalable dLLM reinforcement learning.

Key insights

RLDF enhances dLLM reasoning by efficiently estimating policy loss via weighted sampling of denoising feedback and clean state prediction.

Principles

Method

RLDF collects rollout and training feedback, then uses weighted sampling based on token uncertainty to select denoising steps. Policy loss is computed on clipped \"x_0\" for these steps, aggregated with KL regularization.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.