DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning Models
Summary
DreamReasoner-8B is an open-source 8B-parameter block diffusion reasoning model designed to accelerate decoding for long chain-of-thought (CoT) reasoning. A systematic study revealed that training with large block sizes degrades reasoning performance, while small block sizes preserve it. To overcome this, the authors propose block-size curriculum learning, which gradually transitions training from fine-grained to coarse-grained block sizes. This approach enables DreamReasoner-8B to achieve competitive results on mathematical and code reasoning benchmarks, matching leading autoregressive models like Qwen3-8B. Additionally, the work introduces RelaxedConfidence, an analytical decoding probe that yields average TPF (tokens per forward pass) gains of 22.5% and 54.5% in thinking and answering phases, respectively, by relaxing token commitment thresholds based on local context.
Key takeaway
For AI Scientists and Machine Learning Engineers developing efficient reasoning models, you should consider adopting block-size curriculum learning for diffusion language models. This approach stabilizes performance across varying inference block sizes, enabling competitive reasoning with autoregressive models like Qwen3-8B while offering parallel decoding benefits. Furthermore, explore implementing the RelaxedConfidence decoding strategy to achieve significant throughput gains, particularly in the answering phase, without compromising reasoning fidelity.
Key insights
Block-size curriculum learning enables diffusion models to achieve efficient, robust long-CoT reasoning competitive with autoregressive LLMs.
Principles
- Large training block sizes degrade reasoning in block diffusion models.
- Small training block sizes preserve reasoning capabilities.
- Gradual block-size increase during training improves robustness.
Method
Block-size curriculum learning involves initial training with small block sizes (e.g., 4), then gradually transitioning to larger or mixed block sizes (e.g., 32 or {4,8,16,32}) to acquire robust local causal dependencies and generalize to larger inference blocks.
In practice
- Implement block-size curriculum for diffusion model training.
- Apply RelaxedConfidence decoding for throughput gains.
- Evaluate diffusion models across diverse inference block sizes.
Topics
- Diffusion Language Models
- Block Diffusion
- Chain-of-Thought Reasoning
- Curriculum Learning
- Model Efficiency
- Mathematical Reasoning
- Code Generation
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.