Sumi: Open Uniform Diffusion Language Model from Scratch

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

Sumi is introduced as a fully open 7B uniform diffusion language model, pretrained from scratch on 1.5 trillion tokens. This release addresses a gap in the field, as no uniform diffusion language model (UDLM) has previously been pretrained at both large parameter scale and large token budget, unlike autoregressive and masked diffusion models. Sumi performs competitively with autoregressive models on knowledge, reasoning, and coding benchmarks, though it underperforms on commonsense tasks, potentially due to its education-heavy data mixture. The project aims to provide a clean reference point for studying UDLM scaling behavior, generation dynamics, controllability, and trade-offs, with its model weights, checkpoints, and full training recipe publicly available.

Key takeaway

For AI scientists and machine learning engineers exploring novel language model architectures, Sumi provides a critical open-source baseline for uniform diffusion. You can now directly study its scaling behavior, generation dynamics, and controllability against established autoregressive models, leveraging its released weights and training recipe. This enables deeper research into the unique properties and potential of uniform diffusion, informing future model development and architectural choices.

Key insights

Sumi is the first large-scale, scratch-pretrained uniform diffusion language model, offering a new open research baseline.

Principles

Method

Sumi was pretrained from scratch as a 7B uniform diffusion language model on 1.5T tokens, utilizing a specified data mixture from publicly available corpora.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.