EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios
Summary
ServiceNow-AI has released EVA-Bench Data 2.0, significantly expanding its benchmark for voice agent evaluation. Published on June 4, 2026, this update broadens coverage from one enterprise domain to three: Airline Customer Service Management (CSM), Enterprise IT Service Management (ITSM), and Healthcare HR Service Delivery (HRSD). The benchmark now encompasses 213 evaluation scenarios across 121 tools, representing a roughly 4x increase in scenario coverage. Each scenario was validated for solvability using OpenAI GPT-5.4, Google Gemini 3.1 Pro, and Anthropic Claude Opus 4.6. Scenarios are generated using SyGra with GPT-5.4, ensuring joint consistency across user goals, initial databases, and ground truth. The dataset is open-source and available for download, with a multilingual extension planned.
Key takeaway
For AI Engineers evaluating voice agents for enterprise deployments, EVA-Bench Data 2.0 offers a significantly expanded and rigorously validated benchmark. You should integrate this open-source dataset to test your agents across 213 realistic scenarios in Airline CSM, Enterprise ITSM, and Healthcare HRSD domains. This ensures your models handle complex, domain-specific challenges, including authentication and adversarial interactions, providing a more reliable assessment of real-world performance before deployment. Consider its generation methodology as a blueprint for creating your own high-quality evaluation data.
Key insights
Reproducible voice agent evaluation requires domain-specific, realistic, and varied scenarios with robust, jointly generated ground truth.
Principles
- Focus on voice-first enterprise workflows.
- Model tool schemas and policies for realism.
- Ensure scenario variety, including adversarial cases.
Method
Scenarios are jointly generated using SyGra with GPT-5.4 for user goals, initial databases, and ground truth, then validated via structural, LLM-based, and trace verification checks, followed by manual review and frontier model testing.
In practice
- Evaluate voice agents against diverse enterprise scenarios.
- Reference generation process for custom evaluation datasets.
- Test agent performance on specific domains like ITSM or HRSD.
Topics
- Voice Agents
- Enterprise AI
- Benchmark Datasets
- Synthetic Data Generation
- LLM Evaluation
- Multilingual AI
Code references
Best for: AI Architect, NLP Engineer, AI Scientist, AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.