Sumi: Open Uniform Diffusion Language Model from Scratch
Summary
Sumi is an open 7B uniform diffusion language model (UDLM) pretrained from scratch on 1.5T tokens, representing the first UDLM natively trained at both large parameter and token scales. Built on the generalized interpolating discrete diffusion (GIDD) framework and a LLaMA-style Transformer architecture, Sumi was trained on 288 NVIDIA H100 GPUs for 43,308 GPU-hours. It performs competitively with autoregressive models like Llama 2-7B on general knowledge, reasoning, and coding benchmarks. However, it shows a noticeable gap on commonsense tasks, attributed to its education- and code-heavy data mixture. The model weights, checkpoints, and full training recipe, including a complete specification of the publicly available data mixture, are openly released to foster community research.
Key takeaway
For AI Scientists and ML Engineers exploring alternative language model architectures, Sumi provides a crucial open-source baseline for uniform diffusion models. You should investigate its generation dynamics and controllability, especially considering its competitive performance on knowledge and coding tasks versus its commonsense limitations. Use the released training recipe to replicate or extend this work, focusing on data mixture adjustments or targeted revision mechanisms to address current gaps.
Key insights
Sumi is the first large-scale, scratch-pretrained uniform diffusion language model, offering a new research baseline.
Principles
- Uniform diffusion permits any token to be updated at any step.
- Education-heavy data mixtures improve knowledge/coding, but degrade commonsense scores.
- Confidence-based sampling induces self-organized token commitment order.
Method
Sumi is a 7B-parameter time-agnostic bidirectional Transformer trained with the GIDD objective (SNR-reparameterized form) under pure uniform noise on 1.5T tokens, using a three-stage WSD learning-rate schedule.
In practice
- Evaluate UDLMs on knowledge and coding benchmarks.
- Consider data mixture impact on commonsense reasoning.
- Explore confidence sampling for structured generation.
Topics
- Uniform Diffusion Language Models
- Sumi
- Diffusion Models
- Large Language Models
- Model Pretraining
- Training Data Mixtures
- Generative AI
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.