Show the Signal, Hide the Noise: Spectral Forcing for Pixel-Space Diffusion
Summary
Pixel-space diffusion models, despite being trained on full-bandwidth noisy images, face a capacity-allocation problem because the useful signal is strongly frequency dependent. The data-to-noise contour k*(t) = (1-t)^-2/α implicitly separates low-frequency signal from high-frequency noise, forcing standard denoisers to internally discover this moving bandwidth boundary. To address this, researchers introduce Spectral Forcing, a parameter-free, time-conditional 2D-DCT low-pass operator. Applied to the noisy input before the patch embedder, its cutoff expands monotonically with diffusion time, becoming an identity at the data endpoint. Experiments on ImageNet-256 with JiT-700M/32 show Spectral Forcing consistently improves FID and Inception Score, particularly with coarse patch tokenization. It also enhanced DPG-Bench and GenEval in the SenseNova-U1 text-to-image model, demonstrating transferability beyond class-conditional generation. This method suggests a path to more capacity-efficient pixel-space diffusion.
Key takeaway
For Machine Learning Engineers developing pixel-space diffusion models, you should consider integrating Spectral Forcing to enhance model efficiency and performance. This parameter-free 2D-DCT low-pass operator improves FID and Inception Score, especially with coarse patch tokenization, by explicitly managing frequency-dependent signal and noise. Implementing this technique can lead to more robust and capacity-efficient generative models, transferring benefits even to unified text-to-image architectures like SenseNova-U1.
Key insights
Spectral Forcing improves pixel-space diffusion by explicitly filtering high-frequency noise, optimizing model capacity for signal processing.
Principles
- Useful signal in noisy images is frequency dependent.
- Explicitly managing frequency bands optimizes denoiser capacity.
- High-frequency content can be predominantly noise.
Method
Spectral Forcing applies a time-conditional 2D-DCT low-pass filter to noisy input before the patch embedder, with a cutoff expanding monotonically to identity.
In practice
- Apply 2D-DCT low-pass filtering to noisy inputs.
- Consider coarse patch tokenization for benefits.
- Integrate into text-to-image models.
Topics
- Pixel-space Diffusion
- Spectral Forcing
- 2D-DCT Low-pass Filter
- ImageNet-256
- Text-to-Image Generation
- Model Capacity Optimization
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.