DiffuJudge-AV: A Diffusion-Inspired Framework for Calibrated AV Video Evaluation
Summary
DiffuJudge-AV, a novel framework, addresses the critical issue of unreliable LLM/VLM judges in autonomous driving video evaluation. It highlights that while text-only judges like Claude can show a Pearson correlation of 0.753, their quadratic-weighted Cohen's κ can be as low as 0.057 due to score compression, making them unsuitable for safety-critical decisions. DiffuJudge-AV treats judge scores as noisy observations, exposing them to 7 perturbation levels (e.g., position bias, rubric paraphrase) and denoising the results using a one-step Tweedie posterior mean to report calibrated uncertainty. Tested across 28,400 evaluations on Wayve's LingoQA benchmark, the open-source Qwen2.5-VL-7B model achieved superior metrics, including Pearson r = 0.857, Spearman ρ = 0.856, Cohen's κ = 0.837, MAE = 0.57, and Fail-detection F1 = 0.712, outperforming larger closed models. The study also found that vision grounding significantly improved Claude's scoring range and fail-detection recall from 0.02 to 0.94.
Key takeaway
For MLOps Engineers deploying LLM/VLM judges in autonomous driving, relying solely on Pearson correlation for evaluation is risky. You should prioritize metrics like quadratic-weighted Cohen's κ and Fail-detection F1, as demonstrated by DiffuJudge-AV, to ensure judges are calibrated and reliably flag safety-critical failures. Incorporate vision grounding for visual tasks and use perturbation cascades to audit judge biases, routing uncertain cases to human review to prevent silent evaluation infrastructure failures.
Key insights
LLM/VLM judges for AV evaluation require calibrated uncertainty to prevent misleading safety decisions.
Principles
- Treat judge scores as noisy sensor readings.
- Pearson correlation can hide critical decision boundary failures.
- Vision grounding improves judge reliability for visual tasks.
Method
DiffuJudge-AV applies 7 perturbation levels to judge prompts, collects ~22 scores per item, then uses a one-step Tweedie posterior mean for denoising and uncertainty quantification, wrapped in a split-conformal interval.
In practice
- Use Cohen's κ and Fail-detection F1 for safety-critical eval.
- Incorporate vision frames for visual VLM judging.
- Audit judge bias with perturbation cascades.
Topics
- Autonomous Driving Evaluation
- LLM-as-a-Judge
- VLM-as-a-Judge
- Diffusion Models
- Uncertainty Quantification
- Cohen's Kappa
- LingoQA Benchmark
Code references
Best for: AI Scientist, Research Scientist, Machine Learning Engineer, MLOps Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.