EnterpriseBench: CoreCraft – Measuring AI Agents in Chaotic, Enterprise RL Environments
Summary
Surge AI has launched EnterpriseBench, a new suite of reinforcement learning (RL) environment benchmarks, starting with CoreCraft, designed to evaluate AI agents on complex, high-value enterprise job functions. CoreCraft simulates a high-growth computer hardware startup, requiring agents to navigate over 2,500 entities, parse noisy Slack data, audit shipping manifests against SLAs, and negotiate refunds while adhering to company policy. Initial evaluations show that frontier models like Claude Opus 4.6 and GPT-5.2 achieve only 30-40% problem-solving rates, often failing due to issues like hallucinating refunds, getting stuck in logic loops, or leaking PII. Training a GLM 4.6 model on CoreCraft data improved its in-distribution performance by 11.39 percentage points and demonstrated generalization across external benchmarks like BFCL, Tau2-Bench, and Toolathlon, with Toolathlon showing a 6.8 percentage point increase in Pass@1.
Key takeaway
For AI Scientists and Research Scientists developing autonomous agents, this research highlights that current frontier models struggle significantly with real-world enterprise complexity, policy adherence, and tool orchestration. You should focus on improving agentic capabilities like active data discovery, persistent memory management, and flexible search strategies to avoid common failures such as hallucinating data or getting stuck in unproductive loops. Consider using benchmarks like CoreCraft to rigorously test and train models for generalizable, reliable performance in dynamic environments.
Key insights
EnterpriseBench and CoreCraft evaluate AI agents on complex, realistic enterprise tasks, revealing significant limitations in current frontier models.
Principles
- Real-world agents require active discovery and persistence.
- Policy adherence is critical in enterprise environments.
- Sterile RL environments hinder true AI autonomy.
Method
EnterpriseBench uses RL environments like CoreCraft, featuring 2,500+ entities, 14 entity types, and 23 tools, to test agent capabilities beyond simple question-answering, focusing on long-horizon, domain-specific tasks.
In practice
- Test agents on dynamic, large-scale environments.
- Prioritize agent capabilities for active data discovery.
- Implement robust error handling for tool calls.
Topics
- AI Agent Benchmarking
- Reinforcement Learning
- Large Language Models
- Enterprise AI
- Tool Use
Best for: AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Surge AI Blog.