Sumi: Open Uniform Diffusion Language Model from Scratch

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Data Science & Analytics · Depth: Expert, extended

Summary

Sumi is an open 7B uniform diffusion language model (UDLM) pretrained from scratch on 1.5T tokens, representing the first UDLM natively trained at both large parameter and token scales. Built on the generalized interpolating discrete diffusion (GIDD) framework and a LLaMA-style Transformer architecture, Sumi was trained on 288 NVIDIA H100 GPUs for 43,308 GPU-hours. It performs competitively with autoregressive models like Llama 2-7B on general knowledge, reasoning, and coding benchmarks. However, it shows a noticeable gap on commonsense tasks, attributed to its education- and code-heavy data mixture. The model weights, checkpoints, and full training recipe, including a complete specification of the publicly available data mixture, are openly released to foster community research.

Key takeaway

For AI Scientists and ML Engineers exploring alternative language model architectures, Sumi provides a crucial open-source baseline for uniform diffusion models. You should investigate its generation dynamics and controllability, especially considering its competitive performance on knowledge and coding tasks versus its commonsense limitations. Use the released training recipe to replicate or extend this work, focusing on data mixture adjustments or targeted revision mechanisms to address current gaps.

Key insights

Sumi is the first large-scale, scratch-pretrained uniform diffusion language model, offering a new research baseline.

Principles

Method

Sumi is a 7B-parameter time-agnostic bidirectional Transformer trained with the GIDD objective (SNR-reparameterized form) under pure uniform noise on 1.5T tokens, using a three-stage WSD learning-rate schedule.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.