It's 2026, and We're Still Talking Evals
Summary
Maggie Konstanty, an AI Product Manager at Prosus, discusses the complexities of evaluating AI agents, particularly for large-scale food ordering and e-commerce. She highlights that simple accuracy metrics like "95% accurate" are insufficient for real-world performance, advocating for a shift from pre-ship to production-focused evaluations. Konstanty emphasizes the importance of user drop-off analytics as a critical, yet underutilized, signal for agent failure. She critiques the "20-evaluator trap," stressing the need to design evaluations directly tied to product goals and real user behavior, citing a "surprise me" edge case from Prosus's food ordering agent. Furthermore, she expresses skepticism about LLM-as-a-judge for accuracy in production, preferring alternative approaches, and offers a candid assessment of current eval platforms like Arize/Phoenix, noting that mature teams often revert to custom code. Konstanty concludes by advocating for evaluations as an embedded team practice and suggests using incentive-based red teaming.
Key takeaway
For AI Product Managers and ML engineers struggling to make evaluations meaningful, recognize that pre-ship metrics often fail in production. Focus your evaluation strategy on real user behavior, leveraging drop-off analytics and designing tests directly linked to product outcomes. Consider implementing incentive-based red teaming to proactively identify agent weaknesses and foster a culture where evaluation is a continuous, integrated team practice, rather than a one-time checklist item.
Key insights
Effective AI agent evaluation requires moving beyond simple accuracy to real-world user behavior and production-aligned metrics.
Principles
- User drop-off is a critical failure signal.
- Evals must align with product goals.
- Evaluation is a continuous team practice.
Method
Design evaluations tied to real outcomes, prioritize user drop-off analytics, and consider incentive-based red teaming to identify agent vulnerabilities.
In practice
- Monitor user drop-off rates closely.
- Tie eval metrics to business KPIs.
- Implement red teaming with incentives.
Topics
- LLM Evaluation
- AI Agent Performance
- Production Evaluation
- User Drop-off Analytics
- Evaluation Tooling
Best for: AI Product Manager, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by MLOps.community.