From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents
Summary
A new dataset-agnostic framework has been developed to convert existing text-based tool-calling benchmarks into audio-based evaluations for large language model (LLM) agents. This framework leverages text-to-speech, speaker variation, and environmental noise to generate paired text-audio instances, preserving original dataset annotations without requiring re-annotation of tool schemas or gold labels. Extensive evaluation of 7 omni-modal models on audio-converted versions of Confetti and When2Call benchmarks revealed strong model and task dependency. Gemini-3.1-Flash-Live achieved the highest Confetti score at 70.4, while GPT-Realtime-1.5 performed best on When2Call with 71.9. The text-to-voice performance gap on Confetti ranged from 1.8 points for Qwen3-Omni to 4.8 points for GPT-Realtime-1.5, with argument value misunderstandings in speech identified as a primary failure mode. The framework also includes a reference-free LLM-as-judge protocol, validated against human preferences, finding that open-source Qwen3 judges with at least 8B parameters achieve over 80% agreement with proprietary judges.
Key takeaway
For AI Engineers developing voice agents with tool-calling capabilities, you should integrate audio-based evaluation early in your development cycle. The observed performance gaps between text and voice, particularly in argument value parsing, indicate that text-only benchmarks are insufficient. Focus on improving your model's robustness to speech-induced ambiguities and consider using the proposed framework for reproducible, verifiable diagnostics before real-world deployment.
Key insights
Converting text-based tool-calling benchmarks to audio reveals significant model- and task-dependent performance gaps.
Principles
- Speech-based tool calling introduces new failure modes.
- Performance varies widely across omni-modal models.
- LLM-as-judge can validate evaluation protocols.
Method
The framework converts text benchmarks to audio using text-to-speech, speaker variation, and environmental noise, preserving original annotations. It evaluates omni-modal models and analyzes failure cases, including an ambiguity stress test and LLM-as-judge protocol.
In practice
- Evaluate omni-modal models on audio-converted benchmarks.
- Prioritize robust argument value parsing in speech.
- Consider Qwen3 8B+ models for privacy-preserving evaluation.
Topics
- Tool Calling LLM Agents
- Audio-based Evaluation
- Text-to-Voice Conversion
- Omni-modal Models
- Performance Benchmarking
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.