From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

2026-05-14 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Data Science & Analytics · Depth: Expert, quick

Summary

A new dataset-agnostic framework has been developed to convert existing text-based tool-calling benchmarks into audio-based evaluations for large language model (LLM) agents. This framework leverages text-to-speech, speaker variation, and environmental noise to generate paired text-audio instances, preserving original dataset annotations without requiring re-annotation of tool schemas or gold labels. Extensive evaluation of 7 omni-modal models on audio-converted versions of Confetti and When2Call benchmarks revealed strong model and task dependency. Gemini-3.1-Flash-Live achieved the highest Confetti score at 70.4, while GPT-Realtime-1.5 performed best on When2Call with 71.9. The text-to-voice performance gap on Confetti ranged from 1.8 points for Qwen3-Omni to 4.8 points for GPT-Realtime-1.5, with argument value misunderstandings in speech identified as a primary failure mode. The framework also includes a reference-free LLM-as-judge protocol, validated against human preferences, finding that open-source Qwen3 judges with at least 8B parameters achieve over 80% agreement with proprietary judges.

Key takeaway

For AI Engineers developing voice agents with tool-calling capabilities, you should integrate audio-based evaluation early in your development cycle. The observed performance gaps between text and voice, particularly in argument value parsing, indicate that text-only benchmarks are insufficient. Focus on improving your model's robustness to speech-induced ambiguities and consider using the proposed framework for reproducible, verifiable diagnostics before real-world deployment.

Key insights

Converting text-based tool-calling benchmarks to audio reveals significant model- and task-dependent performance gaps.

Principles

Speech-based tool calling introduces new failure modes.
Performance varies widely across omni-modal models.
LLM-as-judge can validate evaluation protocols.

Method

The framework converts text benchmarks to audio using text-to-speech, speaker variation, and environmental noise, preserving original annotations. It evaluates omni-modal models and analyzes failure cases, including an ambiguity stress test and LLM-as-judge protocol.

In practice

Evaluate omni-modal models on audio-converted benchmarks.
Prioritize robust argument value parsing in speech.
Consider Qwen3 8B+ models for privacy-preserving evaluation.

Topics

Tool Calling LLM Agents
Audio-based Evaluation
Text-to-Voice Conversion
Omni-modal Models
Performance Benchmarking

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.