The maturity phases of running evals — Phil Hetzel, Braintrust
Summary
Phil Hetzel, Solutions Engineering Lead at Braintrust, outlines four maturity phases for conducting evaluations (evals) of AI agents, emphasizing their critical role in ensuring agent quality, mitigating risks, and driving performance improvements. The initial "Just Getting Started" phase involves human annotators providing thumbs up/down feedback and justifications to capture domain-specific failure modes. This progresses to "Measuring to Manage," where LLMs act as judges and objective code-based checks scale human expertise, integrating production traces into evaluation datasets. The "Accounting for Complexity" stage addresses agents interacting with external systems via tool calls, necessitating full trace evaluation and techniques like mock APIs or timestamp queries to manage external system state. Finally, "Advanced Eval Techniques" include automated failure mode discovery through topic modeling and streamlined eval execution using cloud code and CLIs. The presentation stresses that evals are directional, not exhaustive, and even LLM judges require their own validation.
Key takeaway
For MLOps Engineers tasked with deploying and maintaining AI agents, you should systematically mature your evaluation practices. Begin by documenting human expert feedback to identify core failure modes, then scale this knowledge using LLM-as-judge techniques, always validating the judge's output. Crucially, integrate production traces into your evaluation datasets to ensure real-world confidence. As agents interact with external systems, plan for full trace evaluation and mock external states to accurately assess complex behaviors before deployment.
Key insights
Evals mature from human judgment to automated, production-integrated systems, crucial for agent quality and risk management.
Principles
- Evals ensure agent quality and mitigate risks.
- Focus evals on specific agent failure modes.
- LLM judges must be evaluated themselves.
Method
Implement an eval flywheel: capture production traces, identify failures, rerun in offline evals, and use results to guide agent improvement.
In practice
- Document human "vibe checks" with justifications.
- Scale evaluations using LLMs as judges.
- Capture production traces for eval datasets.
Topics
- AI Agent Evaluation
- LLM as Judge
- Agent Quality
- Evaluation Maturity
- Production Tracing
- Tool Calling Agents
Best for: MLOps Engineer, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI Engineer.