When the Judge Is Wrong

2026-01-11 · Source: Agus’s Substack · Field: Technology & Digital — Artificial Intelligence & Machine Learning, FinTech & Digital Financial Services, Data Science & Analytics · Depth: Advanced, long

Summary

A controlled experiment using FinStructBench, a benchmark with graph-verified ground truth, reveals significant unreliability in LLM-as-judge evaluations for structured financial tasks. Researchers measured the False Acceptance Rate (FAR) of Claude Opus 4.6 acting as a judge in four configurations. Even in the best-case scenario, with ground truth provided and strict exact-match instructions, the LLM judge approved 7.1% of wrong answers. When given only the source document (a realistic RAG-style deployment), the FAR rose to 31.7%, meaning nearly one in three wrong answers was approved. Without any ground truth or source, the FAR reached 40.4%. The study highlights that LLM judges, relying on semantic similarity, struggle with numeric precision, plausible but incorrect answers, and partial completeness, particularly in critical financial regulation contexts like threshold and exact recall questions.

Key takeaway

For CTOs and VPs of Engineering evaluating AI systems for regulated financial services, relying solely on LLM-as-judge for structured tasks like compliance checks or numerical comparisons introduces unacceptable risk. Your teams should adopt a tiered verification architecture, reserving LLM judges for subjective or analytical tasks and implementing deterministic, graph-verified evaluation for any task where correctness is objectively provable to meet effective challenge standards and prevent false acceptances from propagating through agentic pipelines.

Key insights

LLM-as-judge is unreliable for structured tasks, even with ground truth or source documents.

Principles

Semantic similarity is insufficient for structured verification.
LLM judges share failure modes with LLM answer generators.
Deterministic tasks require deterministic verification.

Method

The study used FinStructBench to generate graph-verified ground truth for financial documents, then evaluated Claude Opus 4.6 as a judge across four configurations: strict with ground truth, lenient with ground truth, blind, and grounded with source document.

In practice

Implement graph-verified evaluation for deterministic tasks.
Use LLM judgment for analytical tasks with human review.
Invest in clean, consistently defined data semantics.

Topics

LLM-as-Judge
Graph-Verified Ground Truth
Financial Documents
False Acceptance Rate
FinStructBench

Code references

asudjianto-xml/finstructbench

Best for: CTO, VP of Engineering/Data, Executive, AI Scientist, MLOps Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Agus’s Substack.