It's 2026, and We're Still Talking Evals

· Source: MLOps.community · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Software Development & Engineering · Depth: Advanced, quick

Summary

Maggie Konstanty, an AI Product Manager at Prosus, discusses the complexities of evaluating AI agents, particularly for large-scale food ordering and e-commerce. She highlights that simple accuracy metrics like "95% accurate" are insufficient for real-world performance, advocating for a shift from pre-ship to production-focused evaluations. Konstanty emphasizes the importance of user drop-off analytics as a critical, yet underutilized, signal for agent failure. She critiques the "20-evaluator trap," stressing the need to design evaluations directly tied to product goals and real user behavior, citing a "surprise me" edge case from Prosus's food ordering agent. Furthermore, she expresses skepticism about LLM-as-a-judge for accuracy in production, preferring alternative approaches, and offers a candid assessment of current eval platforms like Arize/Phoenix, noting that mature teams often revert to custom code. Konstanty concludes by advocating for evaluations as an embedded team practice and suggests using incentive-based red teaming.

Key takeaway

For AI Product Managers and ML engineers struggling to make evaluations meaningful, recognize that pre-ship metrics often fail in production. Focus your evaluation strategy on real user behavior, leveraging drop-off analytics and designing tests directly linked to product outcomes. Consider implementing incentive-based red teaming to proactively identify agent weaknesses and foster a culture where evaluation is a continuous, integrated team practice, rather than a one-time checklist item.

Key insights

Effective AI agent evaluation requires moving beyond simple accuracy to real-world user behavior and production-aligned metrics.

Principles

Method

Design evaluations tied to real outcomes, prioritize user drop-off analytics, and consider incentive-based red teaming to identify agent vulnerabilities.

In practice

Topics

Best for: AI Product Manager, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by MLOps.community.