DACA-GRPO: Denoising-Aware Credit Assignment for Reinforcement Learning in Diffusion Language Models

· Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

DACA-GRPO (Denoising-Aware Credit Assignment for GRPO) is a novel, plug-and-play enhancement for Group Relative Policy Optimization (GRPO)-style trainers in diffusion large language models (dLLMs). Released on May 8, 2026, DACA-GRPO addresses two key weaknesses in existing dLLM reinforcement learning methods: the lack of temporal credit assignment across denoising steps and biased, high-variance likelihood estimates. It introduces Denoising Progress Scores (DPS), which extract per-token importance weights from intermediate predictions at no additional forward cost, and Stratified Masking Likelihood (SML), which partitions token positions into strata to reduce mean-field bias by providing more context. Evaluated on LLaDA-8B-Instruct across three GRPO base methods (Diffu-GRPO, wd1, GDPO) and seven benchmarks including mathematical reasoning, code generation, constraint satisfaction, and constrained generation, DACA-GRPO consistently improves performance, with gains up to 36.3pp on constraint satisfaction and 7.4pp on code generation.

Key takeaway

For Research Scientists optimizing diffusion LLMs with GRPO-style methods, DACA-GRPO offers a significant performance uplift by addressing fundamental training signal weaknesses. You should consider integrating DACA-GRPO, particularly DPS, into your pipelines, as it consistently improves accuracy across diverse tasks and base methods with minimal overhead. Prioritize SML for long-form generation tasks where mean-field bias is more pronounced, while DPS is a universal add-on.

Key insights

DACA-GRPO improves diffusion LLM reinforcement learning by assigning temporal credit and reducing likelihood estimation bias.

Principles

Method

DACA-GRPO combines Denoising Progress Scores (DPS) for temporal credit assignment and Stratified Masking Likelihood (SML) for improved log-likelihood estimation, integrating them into GRPO-style dLLM trainers.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.