When the Judge Is Wrong
Summary
A controlled experiment using FinStructBench, a benchmark with graph-verified ground truth, reveals significant unreliability in LLM-as-judge evaluations for structured financial tasks. Researchers measured the False Acceptance Rate (FAR) of Claude Opus 4.6 acting as a judge in four configurations. Even in the best-case scenario, with ground truth provided and strict exact-match instructions, the LLM judge approved 7.1% of wrong answers. When given only the source document (a realistic RAG-style deployment), the FAR rose to 31.7%, meaning nearly one in three wrong answers was approved. Without any ground truth or source, the FAR reached 40.4%. The study highlights that LLM judges, relying on semantic similarity, struggle with numeric precision, plausible but incorrect answers, and partial completeness, particularly in critical financial regulation contexts like threshold and exact recall questions.
Key takeaway
For CTOs and VPs of Engineering evaluating AI systems for regulated financial services, relying solely on LLM-as-judge for structured tasks like compliance checks or numerical comparisons introduces unacceptable risk. Your teams should adopt a tiered verification architecture, reserving LLM judges for subjective or analytical tasks and implementing deterministic, graph-verified evaluation for any task where correctness is objectively provable to meet effective challenge standards and prevent false acceptances from propagating through agentic pipelines.
Key insights
LLM-as-judge is unreliable for structured tasks, even with ground truth or source documents.
Principles
- Semantic similarity is insufficient for structured verification.
- LLM judges share failure modes with LLM answer generators.
- Deterministic tasks require deterministic verification.
Method
The study used FinStructBench to generate graph-verified ground truth for financial documents, then evaluated Claude Opus 4.6 as a judge across four configurations: strict with ground truth, lenient with ground truth, blind, and grounded with source document.
In practice
- Implement graph-verified evaluation for deterministic tasks.
- Use LLM judgment for analytical tasks with human review.
- Invest in clean, consistently defined data semantics.
Topics
- LLM-as-Judge
- Graph-Verified Ground Truth
- Financial Documents
- False Acceptance Rate
- FinStructBench
Code references
Best for: CTO, VP of Engineering/Data, Executive, AI Scientist, MLOps Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Agus’s Substack.