The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation

2026-06-18 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

A study on the Frechet Inception Distance (FID), the standard metric for image generation, reveals significant hidden randomness affecting reported scores. Researchers treated FID as a random variable across training and generation seeds, measuring its variance on hundreds of SiT networks trained on class-conditional ImageNet 256x256. Findings indicate that retraining a model with a different seed shifts FID 3.2x more than simply resampling from a fixed network. This variability stems from random initialization, data ordering, and Gaussian noise in the flow-matching loss. Crucially, increasing compute or model size offers minimal improvement, with the FID coefficient of variation (CoV) remaining within a 1-2% band. Per-cell classifier-free-guidance tuning can halve this spread, but a "lucky" training seed can achieve the same FID with up to 2x less compute than an "unlucky" one.

Key takeaway

For machine learning engineers evaluating generative models, your current FID reporting practices may be misleading due to hidden randomness. You should adopt the new protocol: evaluate under per-cell optimal guidance, consider any FID gap below ~1.3% CoV inconclusive, and always report an error bar over several training seeds. This ensures more robust and reproducible benchmark comparisons, preventing misinterpretation of model performance differences.

Key insights

FID scores are highly variable due to training randomness, making single reported numbers unreliable and requiring new evaluation protocols.

Principles

Retraining generative models introduces 3.2x more FID variance than resampling.
Increased compute or model size barely tightens FID score spread.
Per-cell guidance tuning halves FID spread but reshuffles optimal seeds.

Method

Researchers treated FID as a random variable on a two-axis panel of training and generation seeds, directly measuring its variance across hundreds of SiT networks on ImageNet 256x256.

In practice

Evaluate FID under per-cell optimal guidance.
Consider FID gaps below ~1.3% CoV inconclusive.
Report FID with error bars from multiple training seeds.

Topics

Frechet Inception Distance
Generative Models
Model Evaluation
Reproducibility
Classifier-Free Guidance
ImageNet

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.