Ship Real Agents: Hands-On Evals for Agentic Applications — Laurie Voss, Arize
Summary
Lori Voss, Head of Developer Experience at AriseAI, presented a workshop on evaluating AI systems, particularly agents, emphasizing the transition from "vibe checks" to formal evaluations. The session covered fundamental concepts like traces and spans, explaining why traditional unit tests fail for LLMs due to non-deterministic outputs. Voss detailed three types of evaluations: deterministic code evals for simple checks (e.g., JSON format, token limits), LLM-as-a-judge evals for semantic understanding (e.g., factual accuracy, tone), and human evaluation as the gold standard for building golden datasets. The workshop demonstrated setting up tracing with Arise Phoenix, analyzing trace data to identify failure patterns, and writing custom LLM evals with detailed rubrics and examples. It also introduced meta-evaluation to assess judge reliability, the impact hierarchy for prioritizing improvements, and the data flywheel for continuous agent enhancement.
Key takeaway
For AI Engineers developing and deploying LLM agents, relying solely on "vibe checks" is insufficient and leads to unpredictable failures. You should implement a structured evaluation suite using tools like Arise Phoenix to capture traces, analyze failure patterns, and systematically apply code, LLM-as-a-judge, and human evaluations. This approach enables data-driven prompt engineering, ensures agent reliability, and facilitates efficient iteration and model upgrades, preventing regressions and improving overall agent performance.
Key insights
Formal evaluations are crucial for AI agents, moving beyond subjective "vibe checks" to ensure reliability and performance.
Principles
- Evals are tests powered by log data (traces).
- LLM outputs are non-deterministic, requiring semantic evaluation.
- Combine code, LLM, and human evals for comprehensive testing.
Method
Capture agent execution as traces, analyze them for failure patterns, define explicit success criteria, then write and iterate on code and LLM-as-a-judge evals, using meta-evaluation to validate judge reliability.
In practice
- Use Arise Phoenix for AI observability and trace capture.
- Prioritize data quality fixes over prompt tuning.
- Build golden datasets with human judgment for judge validation.
Topics
- AI Agent Evaluation
- Arize Phoenix
- LLM as a Judge
- Trace Data
- Code Evals
Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI Engineer.