Read the Trace, Steer the Path: Trajectory-Aware Reinforcement Learning for Diffusion Language Models
Summary
CAPR (Cached-Amortized Path Refinement) is a novel reinforcement learning algorithm designed for diffusion large language models (dLLMs) that addresses the limitations of existing RL methods. While dLLMs generate responses by iteratively unmasking tokens, current RL approaches either use flat rollouts with sparse rewards or compute-intensive tree rollouts. CAPR summarizes the dLLM's rich denoising trace into a compact path state, utilizes cached trajectory states for generating cheap sibling continuations, and trains a block-level value head for local, block-wise supervision. This method redistributes final outcome rewards across blocks, converting sparse rewards into block-level PPO weights. CAPR achieves tree-like supervision without full tree expansion, reducing rollout-generation cost to approximately 0.75x that of flat rollouts and 0.6x that of tree rollouts. It sets a new state of the art for RL-tuned dLLMs on benchmarks like Sudoku, Countdown, GSM8K, and Math500, matching strong tree-structured baselines at less than one third of the per-step compute on Sudoku.
Key takeaway
For machine learning engineers optimizing diffusion language models, CAPR provides a significant efficiency gain. You can achieve tree-search performance and granular supervision by leveraging denoising traces, reducing rollout generation costs to 0.75x of flat rollouts. Consider integrating CAPR's block-level value head approach to enhance dLLM training, especially when working with 256- and 512-token budgets on LLaDA backbones.
Key insights
CAPR leverages dLLM denoising traces for tree-like supervision with reduced computational cost.
Principles
- Denoising traces offer rich, underutilized signals for dLLM training.
- Block-level value heads can convert sparse rewards into granular PPO weights.
Method
CAPR summarizes the denoising trace into a path state, uses cached trajectory states for sibling continuations, and trains a block-level value head to redistribute final rewards across blocks according to revealed tokens.
In practice
- Utilize denoising traces for finer-grained dLLM supervision.
- Implement block-level value heads for reward redistribution.
Topics
- Diffusion Language Models
- Reinforcement Learning
- Trajectory-Aware RL
- CAPR Algorithm
- Compute Efficiency
- LLaDA Backbones
- GSM8K
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.