Sakana AI Proposes DiffusionBlocks: a Block-wise Training Framework That Converts Residual Networks into Independently Trainable Denoising Modules

2026-05-28 · Source: Machine Learning ML & Generative AI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Sakana AI has introduced DiffusionBlocks, a novel block-wise training framework that transforms residual networks into independently trainable denoising modules. This framework partitions transformer-based networks into B independently trainable blocks, reducing training memory by a factor of B. The core innovation involves reframing residual connections in transformers as Euler discretizations of the probability flow ODE in score-based diffusion models, enabling independent block training via score matching. Converting any residual network requires three modifications: partitioning L layers into B blocks, assigning each block a noise range via equi-probability partitioning, and adding noise-level conditioning via AdaLN. The framework was validated across five architectures, including ViT on CIFAR-100 (59.30% vs 60.25% baseline) and DiT-L/2 on ImageNet 256 (FID 10.63 vs 12.09 baseline with 3x less memory). Equi-probability partitioning significantly improved FID from 43.53 to 38.03 on CIFAR-10 compared to uniform partitioning. Recurrent-depth models, like Huginn, saw the largest gains, with total training compute dropping by approximately 10x for 32-iteration BPTT.

Key takeaway

For Machine Learning Engineers optimizing large transformer models, Sakana AI's DiffusionBlocks offers a significant path to reduce training memory and compute. If you are struggling with GPU memory constraints or long training times for recurrent-depth architectures, consider implementing this framework. It allows you to train network blocks independently, potentially cutting training compute by 10x for models like Huginn, without altering K-iteration inference. This could accelerate your development cycles and enable larger model exploration on existing hardware.

Key insights

Residual connections in transformers can be reframed as diffusion model ODEs, enabling block-wise independent training.

Principles

Residual connections map to probability flow ODEs.
Equi-probability partitioning improves noise assignment.
Independent block training reduces memory and compute.

Method

Partition L layers into B blocks, assign noise ranges via equi-probability partitioning, and add AdaLN for noise-level conditioning.

In practice

Apply DiffusionBlocks to ViT, DiT, Masked diffusion, AR Transformer.
Optimize recurrent-depth models like Huginn for 10x compute reduction.

Topics

DiffusionBlocks
Transformer Networks
Residual Networks
Score-based Diffusion Models
Block-wise Training
Memory Optimization

Code references

SakanaAI/DiffusionBlocks

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.