Remask, Don't Replace: Token-to-Mask Refinement in Masked Diffusion Language Models

2026-04-22 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, extended

Summary

A new inference-time method, Token-to-Mask (T2M) remasking, has been proposed for masked diffusion language models like LLaDA2.1 to address structural failure modes in their Token-to-Token (T2T) editing phase. T2T editing overwrites a suspect token with a new guess if a confidence threshold is met, but this can fail due to correction inertia, context pollution, and train-inference noise mismatch. T2M, instead of overwriting, resets the suspect token to a mask state, allowing the next denoising step to re-predict it from an in-distribution context. This training-free method, which introduces no new parameters, was tested across 8 benchmarks, showing significant accuracy improvements, particularly a +5.92 point gain on CMATH. On CMATH, 79.9% of baseline errors were identified as "last-mile corruption" (correct reasoning, garbled final answer), and T2M repaired 41.3% of these cases.

Key takeaway

For AI Engineers and Research Scientists working with masked diffusion language models, adopting Token-to-Mask (T2M) remasking can significantly enhance model accuracy, especially for tasks demanding precise token-level outputs. You should consider integrating T2M as a training-free, inference-time modification to improve error correction, particularly for "last-mile corruption" issues, by leveraging its ability to reset problematic tokens to a neutral mask state for re-prediction.

Key insights

Remasking erroneous tokens to [M] improves masked diffusion language model accuracy by providing a neutral context for re-prediction.

Principles

A mask is a superior conditioning signal to an erroneous token.
Decouple error detection from correction in generative models.
Inference-time errors differ from training noise distributions.

Method

T2M remasking replaces a suspect token with a mask token [M] instead of a new guess. This allows the model to re-predict the position from an in-distribution context during the next denoising step, using detection heuristics like LowProb, T2T-Remask, or LogitDiff.

In practice

Implement T2M with LowProb at \tau_{\text{lp}}=0.3 for optimal accuracy.
Use safety caps (Cmax=1, \rho_{\max}=0.25) to prevent remasking oscillations.
Prioritize T2M for tasks requiring exact token-level output.

Topics

Masked Diffusion Language Models
Token-to-Token Editing
Token-to-Mask Remasking
Inference-time Self-Correction
Last-mile Corruption

Code references

synsis/remasked_DLM

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.