SpectralDiT: Timestep-Conditioned Spectral Residual Correction for Flow-Matching DiTs

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

SpectralDiT is a novel, lightweight modification for flow-matching Diffusion Transformers (DiTs) that integrates timestep-conditioned spectral correction into the MLP residual branch. This module decomposes each residual update into low- and high-frequency components on the patch-token grid, employing a zero-initialized additive gate to initially match baseline DiT performance. Evaluated on CIFAR-10 pixel-space generation, SpectralDiT improved the FID score from 20.78 to 19.71 at patch size 1, concurrently reducing the radial Fourier spectrum gap. When scaled to latent diffusion on ImageNet-100, the method, with only 0.6% additional theoretical FLOPs and 1.36% more parameters, achieved an 8.7% relative FID reduction under classifier-free guidance (CFG 2.0). All reported results are averaged over five seeds, with ablations confirming stable block-specific spectral correction patterns.

Key takeaway

For Machine Learning Engineers developing Diffusion Transformers, SpectralDiT offers a low-cost path to enhance generative quality. If you are aiming to reduce FID scores in flow-matching DiTs, consider integrating timestep-conditioned spectral correction into your MLP residual branches. This approach, with minimal parameter and FLOPs increase, can yield significant FID reductions, as demonstrated by an 8.7% relative improvement on ImageNet-100. You should evaluate its impact on your specific datasets.

Key insights

SpectralDiT enhances Diffusion Transformers by applying timestep-conditioned spectral correction to residual updates, improving FID with minimal overhead.

Principles

Method

SpectralDiT decomposes residual updates into low- and high-frequency components on a patch-token grid, then learns a zero-initialized additive gate for timestep-conditioned spectral correction.

In practice

Topics

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.