EnterpriseBench: CoreCraft – Measuring AI Agents in Chaotic, Enterprise RL Environments

· Source: Surge AI Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, long

Summary

Surge AI has launched EnterpriseBench, a new suite of reinforcement learning (RL) environment benchmarks, starting with CoreCraft, designed to evaluate AI agents on complex, high-value enterprise job functions. CoreCraft simulates a high-growth computer hardware startup, requiring agents to navigate over 2,500 entities, parse noisy Slack data, audit shipping manifests against SLAs, and negotiate refunds while adhering to company policy. Initial evaluations show that frontier models like Claude Opus 4.6 and GPT-5.2 achieve only 30-40% problem-solving rates, often failing due to issues like hallucinating refunds, getting stuck in logic loops, or leaking PII. Training a GLM 4.6 model on CoreCraft data improved its in-distribution performance by 11.39 percentage points and demonstrated generalization across external benchmarks like BFCL, Tau2-Bench, and Toolathlon, with Toolathlon showing a 6.8 percentage point increase in Pass@1.

Key takeaway

For AI Scientists and Research Scientists developing autonomous agents, this research highlights that current frontier models struggle significantly with real-world enterprise complexity, policy adherence, and tool orchestration. You should focus on improving agentic capabilities like active data discovery, persistent memory management, and flexible search strategies to avoid common failures such as hallucinating data or getting stuck in unproductive loops. Consider using benchmarks like CoreCraft to rigorously test and train models for generalizable, reliable performance in dynamic environments.

Key insights

EnterpriseBench and CoreCraft evaluate AI agents on complex, realistic enterprise tasks, revealing significant limitations in current frontier models.

Principles

Method

EnterpriseBench uses RL environments like CoreCraft, featuring 2,500+ entities, 14 entity types, and 23 tools, to test agent capabilities beyond simple question-answering, focusing on long-horizon, domain-specific tasks.

In practice

Topics

Best for: AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Surge AI Blog.