IHBench: Evaluating Post-Interruption Recovery in Voice Agents with Structured Workflows
Summary
IHBench (Interruption Handling Benchmark) is introduced to evaluate post-interruption recovery in voice agents operating within structured, multi-step workflows across 10 enterprise domains. Unlike existing benchmarks that focus on interruption timing, IHBench assesses an agent's ability to resume workflows correctly, address user interjections, and avoid repeating content after an interruption. It injects six distinct interruption types at controlled points mid-utterance, scoring agents on task fulfillment and recovery quality. The benchmark evaluated 27 audio-language model configurations from OpenAI, Google, and the open-weight community. Results show significant variability, with recovery quality strongly tied to interruption type. Closed-weight models consistently demonstrated greater robustness, winning more often on task fulfillment, degrading approximately 3.3x slower in longer conversations, and exhibiting no audio-versus-text modality gap, a contrast to open-weight models.
Key takeaway
For Machine Learning Engineers deploying voice agents in structured customer service or healthcare workflows, you must prioritize evaluating post-interruption recovery, as current models vary significantly. Your choice of model directly impacts robustness; closed-weight models consistently demonstrate superior performance in handling interruptions and maintaining workflow progress. Integrate IHBench-like evaluations into your development cycle to ensure agents can effectively resume tasks and address user interjections without repeating information.
Key insights
Voice agents need robust post-interruption recovery, a capability distinct from barge-in detection, with closed-weight models currently outperforming open-weight ones.
Principles
- Post-interruption recovery is a distinct agent capability.
- Closed-weight models show superior interruption robustness.
- Recovery quality varies significantly by interruption type.
Method
IHBench evaluates voice agents by injecting six interruption types into state-machine-driven workflows across 10 enterprise domains, scoring task fulfillment and recovery quality using LLM judges.
In practice
- Benchmark voice agents for post-interruption recovery.
- Prioritize closed-weight models for robust agent performance.
- Design workflows considering diverse interruption types.
Topics
- Voice Agents
- Interruption Handling
- Workflow Automation
- LLM Evaluation
- Benchmark Development
- Closed-weight Models
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.