DiffuJudge-AV: A Diffusion-Inspired Framework for Calibrated AV Video Evaluation

2026-05-28 · Source: Towards Data Science · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Data Science & Analytics · Depth: Expert, long

Summary

DiffuJudge-AV, a novel framework, addresses the critical issue of unreliable LLM/VLM judges in autonomous driving video evaluation. It highlights that while text-only judges like Claude can show a Pearson correlation of 0.753, their quadratic-weighted Cohen's κ can be as low as 0.057 due to score compression, making them unsuitable for safety-critical decisions. DiffuJudge-AV treats judge scores as noisy observations, exposing them to 7 perturbation levels (e.g., position bias, rubric paraphrase) and denoising the results using a one-step Tweedie posterior mean to report calibrated uncertainty. Tested across 28,400 evaluations on Wayve's LingoQA benchmark, the open-source Qwen2.5-VL-7B model achieved superior metrics, including Pearson r = 0.857, Spearman ρ = 0.856, Cohen's κ = 0.837, MAE = 0.57, and Fail-detection F1 = 0.712, outperforming larger closed models. The study also found that vision grounding significantly improved Claude's scoring range and fail-detection recall from 0.02 to 0.94.

Key takeaway

For MLOps Engineers deploying LLM/VLM judges in autonomous driving, relying solely on Pearson correlation for evaluation is risky. You should prioritize metrics like quadratic-weighted Cohen's κ and Fail-detection F1, as demonstrated by DiffuJudge-AV, to ensure judges are calibrated and reliably flag safety-critical failures. Incorporate vision grounding for visual tasks and use perturbation cascades to audit judge biases, routing uncertain cases to human review to prevent silent evaluation infrastructure failures.

Key insights

LLM/VLM judges for AV evaluation require calibrated uncertainty to prevent misleading safety decisions.

Principles

Treat judge scores as noisy sensor readings.
Pearson correlation can hide critical decision boundary failures.
Vision grounding improves judge reliability for visual tasks.

Method

DiffuJudge-AV applies 7 perturbation levels to judge prompts, collects ~22 scores per item, then uses a one-step Tweedie posterior mean for denoising and uncertainty quantification, wrapped in a split-conformal interval.

In practice

Use Cohen's κ and Fail-detection F1 for safety-critical eval.
Incorporate vision frames for visual VLM judging.
Audit judge bias with perturbation cascades.

Topics

Autonomous Driving Evaluation
LLM-as-a-Judge
VLM-as-a-Judge
Diffusion Models
Uncertainty Quantification
Cohen's Kappa
LingoQA Benchmark

Code references

syedhumarahim/diffujudge-av

Best for: AI Scientist, Research Scientist, Machine Learning Engineer, MLOps Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.