Sumi: Open Uniform Diffusion Language Model from Scratch
Summary
Sumi is introduced as a fully open 7B uniform diffusion language model, pretrained from scratch on 1.5 trillion tokens. This release addresses a gap in the field, as no uniform diffusion language model (UDLM) has previously been pretrained at both large parameter scale and large token budget, unlike autoregressive and masked diffusion models. Sumi performs competitively with autoregressive models on knowledge, reasoning, and coding benchmarks, though it underperforms on commonsense tasks, potentially due to its education-heavy data mixture. The project aims to provide a clean reference point for studying UDLM scaling behavior, generation dynamics, controllability, and trade-offs, with its model weights, checkpoints, and full training recipe publicly available.
Key takeaway
For AI scientists and machine learning engineers exploring novel language model architectures, Sumi provides a critical open-source baseline for uniform diffusion. You can now directly study its scaling behavior, generation dynamics, and controllability against established autoregressive models, leveraging its released weights and training recipe. This enables deeper research into the unique properties and potential of uniform diffusion, informing future model development and architectural choices.
Key insights
Sumi is the first large-scale, scratch-pretrained uniform diffusion language model, offering a new open research baseline.
Principles
- Uniform diffusion models allow flexible token updates.
- Large-scale scratch pretraining provides clean research baselines.
- Data mixture significantly influences model performance.
Method
Sumi was pretrained from scratch as a 7B uniform diffusion language model on 1.5T tokens, utilizing a specified data mixture from publicly available corpora.
In practice
- Study native uniform diffusion at scale.
- Analyze generation dynamics and controllability.
- Compare UDLM trade-offs against autoregressive models.
Topics
- Uniform Diffusion Models
- Language Models
- Model Pretraining
- Open-Source AI
- Scaling Laws
- Benchmarking
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.