The maturity phases of running evals — Phil Hetzel, Braintrust

2026-05-27 · Source: AI Engineer · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Intermediate, long

Summary

Phil Hetzel, Solutions Engineering Lead at Braintrust, outlines four maturity phases for conducting evaluations (evals) of AI agents, emphasizing their critical role in ensuring agent quality, mitigating risks, and driving performance improvements. The initial "Just Getting Started" phase involves human annotators providing thumbs up/down feedback and justifications to capture domain-specific failure modes. This progresses to "Measuring to Manage," where LLMs act as judges and objective code-based checks scale human expertise, integrating production traces into evaluation datasets. The "Accounting for Complexity" stage addresses agents interacting with external systems via tool calls, necessitating full trace evaluation and techniques like mock APIs or timestamp queries to manage external system state. Finally, "Advanced Eval Techniques" include automated failure mode discovery through topic modeling and streamlined eval execution using cloud code and CLIs. The presentation stresses that evals are directional, not exhaustive, and even LLM judges require their own validation.

Key takeaway

For MLOps Engineers tasked with deploying and maintaining AI agents, you should systematically mature your evaluation practices. Begin by documenting human expert feedback to identify core failure modes, then scale this knowledge using LLM-as-judge techniques, always validating the judge's output. Crucially, integrate production traces into your evaluation datasets to ensure real-world confidence. As agents interact with external systems, plan for full trace evaluation and mock external states to accurately assess complex behaviors before deployment.

Key insights

Evals mature from human judgment to automated, production-integrated systems, crucial for agent quality and risk management.

Principles

Evals ensure agent quality and mitigate risks.
Focus evals on specific agent failure modes.
LLM judges must be evaluated themselves.

Method

Implement an eval flywheel: capture production traces, identify failures, rerun in offline evals, and use results to guide agent improvement.

In practice

Document human "vibe checks" with justifications.
Scale evaluations using LLMs as judges.
Capture production traces for eval datasets.

Topics

AI Agent Evaluation
LLM as Judge
Agent Quality
Evaluation Maturity
Production Tracing
Tool Calling Agents

Best for: MLOps Engineer, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Engineer.