Evaluating Multi-Turn Agents: A Quality Study of Microsoft Foundry’s Multi-Turn Evaluators
Summary
A quality study of Microsoft Foundry's multi-turn evaluators for AI agents reveals their effectiveness in assessing complex, multi-turn interactions. The study rigorously tested four evaluators—Task Completion, Customer Satisfaction (CSAT), Groundedness, and Conversation Coherence—using benchmark datasets like τ²-bench, FaithDial, and Copilot CLI sessions. Evaluation criteria included validity (PR-AUC), reliability (flip rate), and robustness across six judge models, including gpt-5.5, Claude Opus 4.7, and DeepSeek-V3.2. Findings show Task Completion, CSAT, and Conversation Coherence achieve substantial-to-near-perfect agreement with ground truth and low variance, with CSAT reaching 0.82–0.97 PR-AUC. Groundedness, however, is more challenging, exhibiting 0.60–0.82 PR-AUC, and its performance is significantly influenced by judge choice.
Key takeaway
For AI Engineers developing multi-turn agents, evaluating your system's quality requires a nuanced approach. You should prioritize Microsoft Foundry's Task Completion, CSAT, and Conversation Coherence evaluators for robust session-level metrics. For Groundedness, use frontier reasoning judges like gpt-5.5 or Grok-4 as a triage signal, not a hard gate, due to its judge-tier sensitivity. Always test cross-domain and pair validity with reliability to avoid phantom regressions.
Key insights
Microsoft Foundry's multi-turn evaluators offer robust session-level quality assessment for complex AI agents.
Principles
- Evaluate the evaluator itself.
- Reliability is as crucial as validity.
- Judge sensitivity varies by evaluator type.
Method
Microsoft Foundry assesses multi-turn evaluators using benchmark datasets, paired accuracy and reliability metrics, and multiple judge models per evaluator to separate evaluator performance from judge performance.
In practice
- Use single-turn for per-response checks.
- Apply multi-turn for session-level properties.
- Employ adaptive evaluators for bespoke agents.
Topics
- Multi-turn Agents
- LLM-as-Judge Evaluation
- Microsoft Foundry
- Agent Observability
- Task Completion
- Customer Satisfaction
- Groundedness
Best for: AI Architect, NLP Engineer, AI Scientist, AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Microsoft Foundry Blog articles.