IHBench: Evaluating Post-Interruption Recovery in Voice Agents with Structured Workflows

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

IHBench (Interruption Handling Benchmark) is introduced to evaluate post-interruption recovery in voice agents operating within structured, multi-step workflows across 10 enterprise domains. Unlike existing benchmarks that focus on interruption timing, IHBench assesses an agent's ability to resume workflows correctly, address user interjections, and avoid repeating content after an interruption. It injects six distinct interruption types at controlled points mid-utterance, scoring agents on task fulfillment and recovery quality. The benchmark evaluated 27 audio-language model configurations from OpenAI, Google, and the open-weight community. Results show significant variability, with recovery quality strongly tied to interruption type. Closed-weight models consistently demonstrated greater robustness, winning more often on task fulfillment, degrading approximately 3.3x slower in longer conversations, and exhibiting no audio-versus-text modality gap, a contrast to open-weight models.

Key takeaway

For Machine Learning Engineers deploying voice agents in structured customer service or healthcare workflows, you must prioritize evaluating post-interruption recovery, as current models vary significantly. Your choice of model directly impacts robustness; closed-weight models consistently demonstrate superior performance in handling interruptions and maintaining workflow progress. Integrate IHBench-like evaluations into your development cycle to ensure agents can effectively resume tasks and address user interjections without repeating information.

Key insights

Voice agents need robust post-interruption recovery, a capability distinct from barge-in detection, with closed-weight models currently outperforming open-weight ones.

Principles

Method

IHBench evaluates voice agents by injecting six interruption types into state-machine-driven workflows across 10 enterprise domains, scoring task fulfillment and recovery quality using LLM judges.

In practice

Topics

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.