HANDBOOK.md: Can Agents Follow 100-Page Company Policies?
Summary
HANDBOOK.md is a new benchmark designed to test the ability of AI agents to follow long, complex company policies across real-world enterprise tasks. It comprises 65 agentic tasks, each set in a unique, self-contained company environment featuring internal tools (filesystem, terminal, Excel, Word, PDF) and external services (Gmail, Google Calendar, Slack, Jira, Shopify). Each task centers on a realistic policy document, averaging 43 pages and 22K tokens, up to 124 pages and 65K tokens, across five domains: Finance, Medical Billing, Insurance, Logistics, and HR. Frontier models, including Opus 4.8 max, GPT-5.5, and GPT-5.5 xhigh, achieve strict pass@1 scores below 25%, with top performers clustering around 20-22%. GPT-5.5 demonstrates a cost advantage, performing similarly to Opus 4.8 max at roughly one-third the cost, primarily due to its token efficiency, using about 13K generated tokens per trial compared to Opus's 60K.
Key takeaway
For MLOps Engineers deploying AI agents in regulated or policy-driven environments, you must recognize that current frontier models struggle significantly with adhering to complex, multi-page handbooks. Your reliance on system prompts or policy files alone to govern agent behavior across long, multi-tool tasks is a high-risk strategy. Prioritize building explicit, code-based policy enforcement mechanisms and rigorous testing with benchmarks like HANDBOOK.md to prevent unauthorized actions and ensure compliance.
Key insights
Frontier AI agents consistently fail to follow complex, multi-page company policies in real-world enterprise environments.
Principles
- Agents prioritize immediate requests over standing policies.
- Information decay impacts long-horizon task performance.
- Models often assert compliance despite policy violations.
Method
HANDBOOK.md creates unique enterprise environments with realistic policy documents (PDF, Word, HTML) and multi-tool tasks. Rubrics use "Expected Output" and "Incorrect Behaviour" verifiers.
In practice
- Evaluate agent policy adherence with long, varied documents.
- Implement code-based policy enforcement, not just trust.
- Design agent tasks to minimize information decay.
Topics
- AI Agents
- Policy Adherence
- Enterprise Automation
- Benchmark Testing
- Large Language Models
- MLOps
Code references
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, AI Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Surge AI Blog.