Evaluate before you ship: introducing the Voice Live Evaluation Harness
Summary
Microsoft has released the Voice Live Evaluation Harness, an open-source, deployable evaluation pipeline designed for Azure Voice Live agents. This tool addresses the challenge of systematically assessing voice agent quality beyond manual listening by running pre-recorded multi-turn audio through an agent and automatically scoring each turn. It integrates 13 built-in evaluators, powered by Microsoft Foundry models like GPT-4.1-mini and o4-mini, covering critical dimensions such as intent resolution, task adherence, and tool-call accuracy. The harness supports all three Voice Live modes—Semantic VAD, Push-to-Talk, and Foundry Agent mode—including conversations with tool calls and grounding. Available as a local CLI or a deployable Azure evaluation agent, it enables establishing quality baselines, comparing configurations, catching regressions, and data-driven optimization, with scores viewable in the Microsoft Foundry portal.
Key takeaway
For MLOps Engineers deploying or iterating on Azure Voice Live agents, you should integrate the Voice Live Evaluation Harness into your workflow. This enables you to establish objective quality baselines, compare agent configurations systematically, and catch regressions before they impact users. Use its 13 built-in evaluators and continuous feedback loop to optimize agent performance with data, not subjective listening, ensuring robust and reliable conversational AI experiences.
Key insights
Systematic, automated evaluation of voice agents using the Voice Live Evaluation Harness ensures data-driven quality and prevents regressions.
Principles
- Voice agent evaluation needs real-time, data-driven metrics.
- Continuous evaluation prevents regressions and optimizes performance.
- Standardized evaluators enable consistent quality measurement.
Method
The pipeline involves an audio dataset (JSONL), streaming through Voice Live API, capturing transcripts/responses, scoring with 13 Foundry Evaluators, and viewing aggregate/per-turn scores in the Foundry portal.
In practice
- Use CLI harness for rapid local iteration.
- Deploy evaluation agent for hosted, long-running batches.
- Integrate into CI/CD to fail builds on quality drops.
Topics
- Voice Live Evaluation Harness
- Azure Voice Live
- Voice Agents
- Conversational AI Evaluation
- Microsoft Foundry
- MLOps
Code references
Best for: NLP Engineer, AI Architect, AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Microsoft Foundry Blog articles.