The patch model is breaking. AI evaluation needs a new way to disclose what it finds.
Summary
The traditional "patch model" for coordinated vulnerability disclosure, a 30-year standard in software security, is breaking down for AI systems, according to MLCommons. This model assumes affected systems can be repaired, ending the hazard. However, AI evaluation findings are dual-use, meaning results that aid defenders also lower the cost for adversaries to exploit vulnerabilities. Furthermore, providing too much specific feedback to developers corrupts benchmark integrity, as models might improve on tests without actual general improvement. Crucially, released open-weight AI models cannot be patched; a new version is a distinct artifact, and prior vulnerable copies remain operational indefinitely. MLCommons is developing new disclosure practices for its safety and jailbreak benchmarks and contributing these principles to the ISO/IEC TS 42119-8 standard within ISO/IEC JTC 1/SC 42 to establish a citable, field-wide approach for responsible AI evaluation disclosure.
Key takeaway
For AI Security Engineers evaluating frontier models, recognize that traditional vulnerability disclosure models are insufficient. Your findings are dual-use and open-weight models cannot be patched, meaning identified hazards persist indefinitely. You should align your disclosure policies with emerging standards like ISO/IEC TS 42119-8 to protect against harmful uplift and maintain evaluation integrity. Consider joining efforts like MLCommons' agentic security working group to shape future disclosure norms.
Key insights
The traditional software patch model fails for AI due to dual-use findings, test corruption, and unpatchable open-weight models.
Principles
- AI evaluation findings are inherently dual-use.
- Specific test feedback corrupts benchmark integrity.
- Open-weight models cannot be centrally remediated.
Method
MLCommons is designing disclosure practices for its benchmarks and contributing principles to ISO/IEC TS 42119-8 within ISO/IEC JTC 1/SC 42 to codify a new standard.
In practice
- Pin findings to specific model versions.
- Aggregate or withhold sensitive results.
- Align disclosure policies with emerging standards.
Topics
- AI Security
- Vulnerability Disclosure
- AI Evaluation
- Open-weight Models
- MLCommons
- ISO/IEC Standards
- Jailbreak Benchmarks
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Security Engineer, AI Ethicist, Policy Maker
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by MLCommons.