Judge the Judge: Building LLM Evaluators That Actually Work with GEPA — Mahmoud Mabrouk, Agenta AI

2026-04-10 · Source: AI Engineer · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, extended

Summary

This content addresses the challenge of building calibrated Large Language Model (LLM) judges for evaluating AI agents, particularly in customer support scenarios. It highlights that uncalibrated LLM judges, while fast, provide unreliable signals, hindering rapid development and online evaluation. The proposed solution involves optimizing LLM judges to align with human annotations using prompt optimization algorithms like GAPA (Genetic Algorithm for Prompt Optimization). The process includes designing use-case specific, binary metrics, curating and annotating data (emphasizing the importance of detailed reasoning), and then applying GAPA to refine the LLM judge's prompt. A practical walkthrough uses the Towbench airline customer support dataset, demonstrating how a naive seed judge is iteratively improved, increasing accuracy from 69% to 74% and reducing bias by learning specific policy criteria.

Key takeaway

For AI Engineers evaluating agent reliability, relying on uncalibrated LLM judges can lead to misleading signals and slow development. You should implement prompt optimization techniques, such as GAPA, to calibrate your LLM judges against human annotations. Focus on designing specific, binary evaluation metrics and providing detailed reasoning in your training data to significantly improve judge accuracy and reduce bias, thereby accelerating your development and deployment cycles.

Key insights

Calibrated LLM judges, aligned with human annotations via prompt optimization, accelerate AI agent development and reliable online evaluation.

Principles

Metrics must be use-case specific and binary.
Detailed human reasoning is crucial for LLM judge learning.
Seed prompt design significantly impacts optimization success.

Method

Optimize LLM judge prompts using GAPA, which samples new candidates via mutation and merging, evaluates them against human annotations, and filters using a Pareto frontier to improve performance and reduce bias.

In practice

Use GAPA's "optimize anything" API for prompt optimization.
Start with a biased seed judge assuming compliance.
Iterate on reflection templates to guide LLM judge learning.

Topics

LLM as a Judge
Prompt Optimization
GAPA Algorithm
Agenta Platform
Towbench Dataset

Best for: AI Engineer, MLOps Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Engineer.