Evaluate before you ship: introducing the Voice Live Evaluation Harness

2026-06-03 · Source: Microsoft Foundry Blog articles · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Intermediate, short

Summary

Microsoft has released the Voice Live Evaluation Harness, an open-source, deployable evaluation pipeline designed for Azure Voice Live agents. This tool addresses the challenge of systematically assessing voice agent quality beyond manual listening by running pre-recorded multi-turn audio through an agent and automatically scoring each turn. It integrates 13 built-in evaluators, powered by Microsoft Foundry models like GPT-4.1-mini and o4-mini, covering critical dimensions such as intent resolution, task adherence, and tool-call accuracy. The harness supports all three Voice Live modes—Semantic VAD, Push-to-Talk, and Foundry Agent mode—including conversations with tool calls and grounding. Available as a local CLI or a deployable Azure evaluation agent, it enables establishing quality baselines, comparing configurations, catching regressions, and data-driven optimization, with scores viewable in the Microsoft Foundry portal.

Key takeaway

For MLOps Engineers deploying or iterating on Azure Voice Live agents, you should integrate the Voice Live Evaluation Harness into your workflow. This enables you to establish objective quality baselines, compare agent configurations systematically, and catch regressions before they impact users. Use its 13 built-in evaluators and continuous feedback loop to optimize agent performance with data, not subjective listening, ensuring robust and reliable conversational AI experiences.

Key insights

Systematic, automated evaluation of voice agents using the Voice Live Evaluation Harness ensures data-driven quality and prevents regressions.

Principles

Voice agent evaluation needs real-time, data-driven metrics.
Continuous evaluation prevents regressions and optimizes performance.
Standardized evaluators enable consistent quality measurement.

Method

The pipeline involves an audio dataset (JSONL), streaming through Voice Live API, capturing transcripts/responses, scoring with 13 Foundry Evaluators, and viewing aggregate/per-turn scores in the Foundry portal.

In practice

Use CLI harness for rapid local iteration.
Deploy evaluation agent for hosted, long-running batches.
Integrate into CI/CD to fail builds on quality drops.

Topics

Voice Live Evaluation Harness
Azure Voice Live
Voice Agents
Conversational AI Evaluation
Microsoft Foundry
MLOps

Code references

microsoft-foundry/voicelive-evaluation

Best for: NLP Engineer, AI Architect, AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Microsoft Foundry Blog articles.