EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios

2026-05-14 · Source: Hugging Face - Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, medium

Summary

ServiceNow-AI has released EVA-Bench Data 2.0, significantly expanding its benchmark for voice agent evaluation. Published on June 4, 2026, this update broadens coverage from one enterprise domain to three: Airline Customer Service Management (CSM), Enterprise IT Service Management (ITSM), and Healthcare HR Service Delivery (HRSD). The benchmark now encompasses 213 evaluation scenarios across 121 tools, representing a roughly 4x increase in scenario coverage. Each scenario was validated for solvability using OpenAI GPT-5.4, Google Gemini 3.1 Pro, and Anthropic Claude Opus 4.6. Scenarios are generated using SyGra with GPT-5.4, ensuring joint consistency across user goals, initial databases, and ground truth. The dataset is open-source and available for download, with a multilingual extension planned.

Key takeaway

For AI Engineers evaluating voice agents for enterprise deployments, EVA-Bench Data 2.0 offers a significantly expanded and rigorously validated benchmark. You should integrate this open-source dataset to test your agents across 213 realistic scenarios in Airline CSM, Enterprise ITSM, and Healthcare HRSD domains. This ensures your models handle complex, domain-specific challenges, including authentication and adversarial interactions, providing a more reliable assessment of real-world performance before deployment. Consider its generation methodology as a blueprint for creating your own high-quality evaluation data.

Key insights

Reproducible voice agent evaluation requires domain-specific, realistic, and varied scenarios with robust, jointly generated ground truth.

Principles

Focus on voice-first enterprise workflows.
Model tool schemas and policies for realism.
Ensure scenario variety, including adversarial cases.

Method

Scenarios are jointly generated using SyGra with GPT-5.4 for user goals, initial databases, and ground truth, then validated via structural, LLM-based, and trace verification checks, followed by manual review and frontier model testing.

In practice

Evaluate voice agents against diverse enterprise scenarios.
Reference generation process for custom evaluation datasets.
Test agent performance on specific domains like ITSM or HRSD.

Topics

Voice Agents
Enterprise AI
Benchmark Datasets
Synthetic Data Generation
LLM Evaluation
Multilingual AI

Code references

Best for: AI Architect, NLP Engineer, AI Scientist, AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.