Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands
Summary
A new position paper argues that current AI governance frameworks, enacted between 2019 and early 2026, demand safety claims that behavioral assurance methodologies cannot verify. These frameworks require evidence for properties like the absence of hidden objectives and resistance to loss-of-control precursors, which go beyond observable model outputs. The authors formalize this discrepancy as the "audit gap," defining it as the divergence between required and achievable verification access. They also introduce "fragile assurance" for cases where evidence does not support the safety claim. An analysis of 21 instruments reveals an incentive gradient favoring surface-level behavioral proxies over deep structural verification. The paper proposes bounding the weight of behavioral evidence in legal texts and extending pre-deployment access to include mechanistic-evidence classes such as linear probes and activation patching.
Key takeaway
For AI governance and safety researchers, this paper highlights a critical "audit gap" in current assurance practices. You should re-evaluate reliance on purely behavioral evaluations for verifying complex safety properties like hidden objectives or loss-of-control. Consider advocating for policy shifts that incorporate mechanistic evidence, such as linear probes and activation patching, to provide more robust and verifiable safety claims for advanced AI systems.
Key insights
Behavioral AI assurance cannot verify the deep safety claims demanded by current governance frameworks.
Principles
- An "audit gap" exists between required and achievable AI verification.
- "Fragile assurance" describes unsupported safety claims.
- Incentives favor surface-level behavioral proxies.
Method
The paper formalizes the "audit gap" and "fragile assurance" concepts, then analyzes 21 instruments to identify an incentive gradient, proposing a technical pivot towards mechanistic evidence.
In practice
- Bound behavioral evidence in legal texts.
- Extend pre-deployment access for mechanistic evidence.
- Utilize linear probes and activation patching.
Topics
- Behavioural Assurance
- AI Governance Frameworks
- Audit Gap
- Fragile Assurance
- Mechanistic Interpretability
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Ethicist, Policy Maker
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.