Sumi: Open Uniform Diffusion Language Model from Scratch

2026-06-16 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Data Science & Analytics · Depth: Expert, extended

Summary

Sumi is an open 7B uniform diffusion language model (UDLM) pretrained from scratch on 1.5T tokens, representing the first UDLM natively trained at both large parameter and token scales. Built on the generalized interpolating discrete diffusion (GIDD) framework and a LLaMA-style Transformer architecture, Sumi was trained on 288 NVIDIA H100 GPUs for 43,308 GPU-hours. It performs competitively with autoregressive models like Llama 2-7B on general knowledge, reasoning, and coding benchmarks. However, it shows a noticeable gap on commonsense tasks, attributed to its education- and code-heavy data mixture. The model weights, checkpoints, and full training recipe, including a complete specification of the publicly available data mixture, are openly released to foster community research.

Key takeaway

For AI Scientists and ML Engineers exploring alternative language model architectures, Sumi provides a crucial open-source baseline for uniform diffusion models. You should investigate its generation dynamics and controllability, especially considering its competitive performance on knowledge and coding tasks versus its commonsense limitations. Use the released training recipe to replicate or extend this work, focusing on data mixture adjustments or targeted revision mechanisms to address current gaps.

Key insights

Sumi is the first large-scale, scratch-pretrained uniform diffusion language model, offering a new research baseline.

Principles

Uniform diffusion permits any token to be updated at any step.
Education-heavy data mixtures improve knowledge/coding, but degrade commonsense scores.
Confidence-based sampling induces self-organized token commitment order.

Method

Sumi is a 7B-parameter time-agnostic bidirectional Transformer trained with the GIDD objective (SNR-reparameterized form) under pure uniform noise on 1.5T tokens, using a three-stage WSD learning-rate schedule.

In practice

Evaluate UDLMs on knowledge and coding benchmarks.
Consider data mixture impact on commonsense reasoning.
Explore confidence sampling for structured generation.

Topics

Uniform Diffusion Language Models
Sumi
Diffusion Models
Large Language Models
Model Pretraining
Training Data Mixtures
Generative AI

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.