Show the Signal, Hide the Noise: Spectral Forcing for Pixel-Space Diffusion

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

Pixel-space diffusion models, despite being trained on full-bandwidth noisy images, face a capacity-allocation problem because the useful signal is strongly frequency dependent. The data-to-noise contour k*(t) = (1-t)^-2/α implicitly separates low-frequency signal from high-frequency noise, forcing standard denoisers to internally discover this moving bandwidth boundary. To address this, researchers introduce Spectral Forcing, a parameter-free, time-conditional 2D-DCT low-pass operator. Applied to the noisy input before the patch embedder, its cutoff expands monotonically with diffusion time, becoming an identity at the data endpoint. Experiments on ImageNet-256 with JiT-700M/32 show Spectral Forcing consistently improves FID and Inception Score, particularly with coarse patch tokenization. It also enhanced DPG-Bench and GenEval in the SenseNova-U1 text-to-image model, demonstrating transferability beyond class-conditional generation. This method suggests a path to more capacity-efficient pixel-space diffusion.

Key takeaway

For Machine Learning Engineers developing pixel-space diffusion models, you should consider integrating Spectral Forcing to enhance model efficiency and performance. This parameter-free 2D-DCT low-pass operator improves FID and Inception Score, especially with coarse patch tokenization, by explicitly managing frequency-dependent signal and noise. Implementing this technique can lead to more robust and capacity-efficient generative models, transferring benefits even to unified text-to-image architectures like SenseNova-U1.

Key insights

Spectral Forcing improves pixel-space diffusion by explicitly filtering high-frequency noise, optimizing model capacity for signal processing.

Principles

Method

Spectral Forcing applies a time-conditional 2D-DCT low-pass filter to noisy input before the patch embedder, with a cutoff expanding monotonically to identity.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.