Ship Real Agents: Hands-On Evals for Agentic Applications — Laurie Voss, Arize

2026-05-14 · Source: AI Engineer · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, extended

Summary

Lori Voss, Head of Developer Experience at AriseAI, presented a workshop on evaluating AI systems, particularly agents, emphasizing the transition from "vibe checks" to formal evaluations. The session covered fundamental concepts like traces and spans, explaining why traditional unit tests fail for LLMs due to non-deterministic outputs. Voss detailed three types of evaluations: deterministic code evals for simple checks (e.g., JSON format, token limits), LLM-as-a-judge evals for semantic understanding (e.g., factual accuracy, tone), and human evaluation as the gold standard for building golden datasets. The workshop demonstrated setting up tracing with Arise Phoenix, analyzing trace data to identify failure patterns, and writing custom LLM evals with detailed rubrics and examples. It also introduced meta-evaluation to assess judge reliability, the impact hierarchy for prioritizing improvements, and the data flywheel for continuous agent enhancement.

Key takeaway

For AI Engineers developing and deploying LLM agents, relying solely on "vibe checks" is insufficient and leads to unpredictable failures. You should implement a structured evaluation suite using tools like Arise Phoenix to capture traces, analyze failure patterns, and systematically apply code, LLM-as-a-judge, and human evaluations. This approach enables data-driven prompt engineering, ensures agent reliability, and facilitates efficient iteration and model upgrades, preventing regressions and improving overall agent performance.

Key insights

Formal evaluations are crucial for AI agents, moving beyond subjective "vibe checks" to ensure reliability and performance.

Principles

Evals are tests powered by log data (traces).
LLM outputs are non-deterministic, requiring semantic evaluation.
Combine code, LLM, and human evals for comprehensive testing.

Method

Capture agent execution as traces, analyze them for failure patterns, define explicit success criteria, then write and iterate on code and LLM-as-a-judge evals, using meta-evaluation to validate judge reliability.

In practice

Use Arise Phoenix for AI observability and trace capture.
Prioritize data quality fixes over prompt tuning.
Build golden datasets with human judgment for judge validation.

Topics

AI Agent Evaluation
Arize Phoenix
LLM as a Judge
Trace Data
Code Evals

Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Engineer.