The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation
Summary
A study on the Frechet Inception Distance (FID), the standard metric for image generation, reveals significant hidden randomness affecting reported scores. Researchers treated FID as a random variable across training and generation seeds, measuring its variance on hundreds of SiT networks trained on class-conditional ImageNet 256x256. Findings indicate that retraining a model with a different seed shifts FID 3.2x more than simply resampling from a fixed network. This variability stems from random initialization, data ordering, and Gaussian noise in the flow-matching loss. Crucially, increasing compute or model size offers minimal improvement, with the FID coefficient of variation (CoV) remaining within a 1-2% band. Per-cell classifier-free-guidance tuning can halve this spread, but a "lucky" training seed can achieve the same FID with up to 2x less compute than an "unlucky" one.
Key takeaway
For machine learning engineers evaluating generative models, your current FID reporting practices may be misleading due to hidden randomness. You should adopt the new protocol: evaluate under per-cell optimal guidance, consider any FID gap below ~1.3% CoV inconclusive, and always report an error bar over several training seeds. This ensures more robust and reproducible benchmark comparisons, preventing misinterpretation of model performance differences.
Key insights
FID scores are highly variable due to training randomness, making single reported numbers unreliable and requiring new evaluation protocols.
Principles
- Retraining generative models introduces 3.2x more FID variance than resampling.
- Increased compute or model size barely tightens FID score spread.
- Per-cell guidance tuning halves FID spread but reshuffles optimal seeds.
Method
Researchers treated FID as a random variable on a two-axis panel of training and generation seeds, directly measuring its variance across hundreds of SiT networks on ImageNet 256x256.
In practice
- Evaluate FID under per-cell optimal guidance.
- Consider FID gaps below ~1.3% CoV inconclusive.
- Report FID with error bars from multiple training seeds.
Topics
- Frechet Inception Distance
- Generative Models
- Model Evaluation
- Reproducibility
- Classifier-Free Guidance
- ImageNet
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.