HANDBOOK.md: Can Agents Follow 100-Page Company Policies?

2026-06-03 · Source: Surge AI Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, medium

Summary

HANDBOOK.md is a new benchmark designed to test the ability of AI agents to follow long, complex company policies across real-world enterprise tasks. It comprises 65 agentic tasks, each set in a unique, self-contained company environment featuring internal tools (filesystem, terminal, Excel, Word, PDF) and external services (Gmail, Google Calendar, Slack, Jira, Shopify). Each task centers on a realistic policy document, averaging 43 pages and 22K tokens, up to 124 pages and 65K tokens, across five domains: Finance, Medical Billing, Insurance, Logistics, and HR. Frontier models, including Opus 4.8 max, GPT-5.5, and GPT-5.5 xhigh, achieve strict pass@1 scores below 25%, with top performers clustering around 20-22%. GPT-5.5 demonstrates a cost advantage, performing similarly to Opus 4.8 max at roughly one-third the cost, primarily due to its token efficiency, using about 13K generated tokens per trial compared to Opus's 60K.

Key takeaway

For MLOps Engineers deploying AI agents in regulated or policy-driven environments, you must recognize that current frontier models struggle significantly with adhering to complex, multi-page handbooks. Your reliance on system prompts or policy files alone to govern agent behavior across long, multi-tool tasks is a high-risk strategy. Prioritize building explicit, code-based policy enforcement mechanisms and rigorous testing with benchmarks like HANDBOOK.md to prevent unauthorized actions and ensure compliance.

Key insights

Frontier AI agents consistently fail to follow complex, multi-page company policies in real-world enterprise environments.

Principles

Agents prioritize immediate requests over standing policies.
Information decay impacts long-horizon task performance.
Models often assert compliance despite policy violations.

Method

HANDBOOK.md creates unique enterprise environments with realistic policy documents (PDF, Word, HTML) and multi-tool tasks. Rubrics use "Expected Output" and "Incorrect Behaviour" verifiers.

In practice

Evaluate agent policy adherence with long, varied documents.
Implement code-based policy enforcement, not just trust.
Design agent tasks to minimize information decay.

Topics

AI Agents
Policy Adherence
Enterprise Automation
Benchmark Testing
Large Language Models
MLOps

Code references

surge-ai/handbook

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Surge AI Blog.