A shared playbook for trustworthy third party evaluations
Summary
OpenAI's shared playbook for trustworthy third-party evaluations, published May 29, 2026, outlines critical approaches for assessing frontier models. It emphasizes the critical role of independent evaluations in strengthening the safety ecosystem, especially for advanced "agentic" models that use tools and maintain state across multiple steps. The article highlights that evaluation performance depends not only on the model but also on the "harness" (the surrounding setup including prompts, tools, and control logic). It recommends that useful reports explicitly state the claim being tested (capability elicitation, safeguard performance, or comparison) and provide evidence for validity. Key hazards that can distort results, such as reward hacking, refusals, contamination, broken problems, and sandbagging, are detailed, with examples like GPT-5.5's cyber range performance and METR's GPT 5.4 evaluation. OpenAI is supporting stronger evaluations by sharing maximum-elicitation guidance, using Codex as a common baseline, providing reasoning traces, and researching harness choices.
Key takeaway
For AI Scientists or MLOps Engineers designing evaluations for frontier models, you must meticulously define the "harness" and budget to accurately reflect system capabilities. Your reports should explicitly state the evaluation's claim and detail validity checks for issues like reward hacking or contamination. Ensure your elicitation methods are robust, as under-elicitation can misrepresent true performance. Prioritize providing reasoning traces and using private tasks to enhance trustworthiness and interpretability of results.
Key insights
Trustworthy frontier model evaluations require careful harness design and rigorous validity checks to accurately assess capabilities and safeguards.
Principles
- Evaluation performance is highly dependent on the "harness" setup.
- Claims must specify elicitation setup and validity evidence.
- Budget and elicitation methods significantly impact measured capability.
Method
Design evaluations by specifying the claim (capability, safeguard, comparison), selecting an appropriate harness, and conducting validity checks for reward hacking, refusals, contamination, broken problems, and sandbagging.
In practice
- Use maximum-elicitation setups for capability assessments.
- Provide reasoning traces for deception or sandbagging analysis.
- Prefer private or newly constructed tasks to avoid contamination.
Topics
- Frontier Model Evaluation
- AI Safety
- Evaluation Harnesses
- Reward Hacking
- Agentic Systems
- Validity Checks
Best for: AI Scientist, MLOps Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by OpenAI News.