DiffusionBench: On Holistic Evaluation of Diffusion Transformers
Summary
DiffusionBench introduces a holistic evaluation framework for Diffusion Transformers (DiTs), addressing the current limitation where research primarily focuses on class-conditional ImageNet generation. This narrow focus often fails to reflect true progress in generative modeling, especially for text-to-image (T2I) tasks, which are frequently overlooked due to perceived training and evaluation costs. The authors present NanoGen, a unified DiT training and evaluation framework that matches state-of-the-art ImageNet baselines and can train competitive T2I models with just 12 lines of configuration change. NanoGen supports various diffusion methods (RAE, VAE, pixel-space, MeanFlow) across both ImageNet and T2I setups, demonstrating that T2I training requires comparable compute to ImageNet. After training 21 latent diffusion models, a critical finding emerged: method rankings show no strong correlation between ImageNet and T2I generation, with Pearson correlations ranging from -0.377 to -0.580. This indicates that improvements on ImageNet do not reliably translate to T2I, underscoring the necessity of evaluating DiTs on both tasks. DiffusionBench, compiling both ImageNet and T2I results, is recommended for reporting broader progress.
Key takeaway
For AI Scientists and Machine Learning Engineers developing Diffusion Transformers, you should broaden your evaluation beyond class-conditional ImageNet. The lack of correlation (Pearson -0.377 to -0.580) between ImageNet and text-to-image performance means optimizing solely for one task risks failing on others. Adopt DiffusionBench, which integrates both ImageNet and T2I results, to holistically assess your models and ensure your advancements reflect broader generative modeling progress. Utilize frameworks like NanoGen to efficiently train and evaluate T2I models, as compute costs are now comparable.
Key insights
ImageNet-centric DiT evaluation is insufficient; T2I performance does not correlate, necessitating holistic benchmarks like DiffusionBench.
Principles
- ImageNet FID improvements do not guarantee T2I progress.
- Holistic evaluation across diverse tasks is crucial for DiT research.
- T2I training compute is comparable to ImageNet.
Method
NanoGen provides a unified framework for DiT training and evaluation, supporting multiple diffusion methods and enabling competitive T2I model training with minimal configuration changes.
In practice
- Use NanoGen to train competitive text-to-image models.
- Evaluate DiTs on both ImageNet and text-to-image tasks.
- Report DiffusionBench results for comprehensive model assessment.
Topics
- Diffusion Transformers
- Text-to-Image Generation
- ImageNet
- Generative Models
- Model Evaluation
- DiffusionBench
- NanoGen
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.