SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

The Sandwiched Policy Gradient (SPG) is a novel reinforcement learning algorithm designed to improve the alignment of diffusion large language models (dLLMs) with human preferences or task-specific rewards. dLLMs offer efficient parallel token decoding, but their intractable log-likelihood makes standard policy gradient methods challenging. Existing approaches use one-sided approximations like the Evidence Lower Bound (ELBO), which can introduce bias. SPG addresses this by leveraging both an upper and a lower bound of the true log-likelihood: maximizing the ELBO for positive-reward sequences and minimizing an Evidence Upper Bound (EUBO) for negative-reward ones. This method, combined with a block-wise masking strategy for Monte Carlo estimation, significantly reduces policy gradient bias and improves optimization stability. Experiments show SPG outperforms state-of-the-art RL methods for dLLMs, achieving accuracy improvements of 3.6% on GSM8K, 2.6% on MATH500, 18.4% on Countdown, and 27.0% on Sudoku benchmarks.

Key takeaway

Research Scientists working with diffusion large language models (dLLMs) for fine-tuning should adopt the Sandwiched Policy Gradient (SPG) to overcome limitations of biased policy gradients. By incorporating both upper and lower bounds of the log-likelihood, SPG offers a more robust and stable approach to aligning dLLMs with specific rewards, leading to significant performance gains on reasoning tasks. You should explore the provided codebase to integrate SPG, particularly its block-wise masking strategy, into your dLLM training pipelines.

Key insights

SPG uses both upper and lower bounds of intractable log-likelihoods to reduce bias in dLLM reinforcement learning.

Principles

Method

SPG maximizes ELBO for positive rewards and minimizes EUBO for negative rewards, using a block-wise masking strategy for Monte Carlo estimation to stabilize training and reduce gradient variance.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.