Sakana AI Proposes DiffusionBlocks: a Block-wise Training Framework That Converts Residual Networks into Independently Trainable Denoising Modules
Summary
Sakana AI has introduced DiffusionBlocks, a novel block-wise training framework that transforms residual networks into independently trainable denoising modules. This framework partitions transformer-based networks into B independently trainable blocks, reducing training memory by a factor of B. The core innovation involves reframing residual connections in transformers as Euler discretizations of the probability flow ODE in score-based diffusion models, enabling independent block training via score matching. Converting any residual network requires three modifications: partitioning L layers into B blocks, assigning each block a noise range via equi-probability partitioning, and adding noise-level conditioning via AdaLN. The framework was validated across five architectures, including ViT on CIFAR-100 (59.30% vs 60.25% baseline) and DiT-L/2 on ImageNet 256 (FID 10.63 vs 12.09 baseline with 3x less memory). Equi-probability partitioning significantly improved FID from 43.53 to 38.03 on CIFAR-10 compared to uniform partitioning. Recurrent-depth models, like Huginn, saw the largest gains, with total training compute dropping by approximately 10x for 32-iteration BPTT.
Key takeaway
For Machine Learning Engineers optimizing large transformer models, Sakana AI's DiffusionBlocks offers a significant path to reduce training memory and compute. If you are struggling with GPU memory constraints or long training times for recurrent-depth architectures, consider implementing this framework. It allows you to train network blocks independently, potentially cutting training compute by 10x for models like Huginn, without altering K-iteration inference. This could accelerate your development cycles and enable larger model exploration on existing hardware.
Key insights
Residual connections in transformers can be reframed as diffusion model ODEs, enabling block-wise independent training.
Principles
- Residual connections map to probability flow ODEs.
- Equi-probability partitioning improves noise assignment.
- Independent block training reduces memory and compute.
Method
Partition L layers into B blocks, assign noise ranges via equi-probability partitioning, and add AdaLN for noise-level conditioning.
In practice
- Apply DiffusionBlocks to ViT, DiT, Masked diffusion, AR Transformer.
- Optimize recurrent-depth models like Huginn for 10x compute reduction.
Topics
- DiffusionBlocks
- Transformer Networks
- Residual Networks
- Score-based Diffusion Models
- Block-wise Training
- Memory Optimization
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.