Red-Teaming Anthropic's Internal Agent Monitoring Systems
Summary
METR staff member David Rein conducted a three-week red-teaming exercise in collaboration with Anthropic, targeting a subset of Anthropic's internal agent monitoring and security systems. These systems are detailed in the Opus 4.6 Sabotage Risk Report (Appendix 8.4, specifically 8.4.8). The exercise uncovered several novel vulnerabilities, some of which have since been patched, and none severely undermine major claims in the Opus 4.6 report. It also generated artifacts like agent trajectories containing covert attacks and a small attack strategy ideation test set, expected to aid ongoing improvements. A 26-page report was shared internally. This adversarial testing by external researchers is deemed valuable for discovering vulnerabilities and establishing best practices for embedding third-party evaluators within frontier AI companies. Further details are available in the Frontier Risk Report (February-March 2026), Appendix B.
Key takeaway
For frontier AI developers building or securing advanced agent monitoring systems, integrating external red-teaming is crucial. This approach, exemplified by METR's three-week exercise with Anthropic, effectively uncovers novel vulnerabilities and generates valuable attack artifacts. You should actively seek third-party evaluators to strengthen your security posture and establish best practices for collaborative vulnerability discovery, enhancing system robustness against covert attacks.
Key insights
External red-teaming effectively identifies novel vulnerabilities in frontier AI agent monitoring systems.
Principles
- Adversarial testing reveals system vulnerabilities.
- Third-party evaluators enhance AI security practices.
- Collaboration improves monitoring system robustness.
Method
The exercise involved a three-week engagement with substantial access to internal systems, staff, and information, culminating in a detailed report of discovered vulnerabilities and attack artifacts.
In practice
- Conduct external red-teaming for AI systems.
- Share detailed reports with developers.
- Use attack artifacts for system improvement.
Topics
- Red Teaming
- AI Security
- Agent Monitoring Systems
- Vulnerability Discovery
- Frontier AI
- Third-Party Evaluation
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Security Engineer, AI Scientist, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by METR.