Red-Teaming Anthropic's Internal Agent Monitoring Systems

2026-03-26 · Source: METR · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

METR staff member David Rein conducted a three-week red-teaming exercise in collaboration with Anthropic, targeting a subset of Anthropic's internal agent monitoring and security systems. These systems are detailed in the Opus 4.6 Sabotage Risk Report (Appendix 8.4, specifically 8.4.8). The exercise uncovered several novel vulnerabilities, some of which have since been patched, and none severely undermine major claims in the Opus 4.6 report. It also generated artifacts like agent trajectories containing covert attacks and a small attack strategy ideation test set, expected to aid ongoing improvements. A 26-page report was shared internally. This adversarial testing by external researchers is deemed valuable for discovering vulnerabilities and establishing best practices for embedding third-party evaluators within frontier AI companies. Further details are available in the Frontier Risk Report (February-March 2026), Appendix B.

Key takeaway

For frontier AI developers building or securing advanced agent monitoring systems, integrating external red-teaming is crucial. This approach, exemplified by METR's three-week exercise with Anthropic, effectively uncovers novel vulnerabilities and generates valuable attack artifacts. You should actively seek third-party evaluators to strengthen your security posture and establish best practices for collaborative vulnerability discovery, enhancing system robustness against covert attacks.

Key insights

External red-teaming effectively identifies novel vulnerabilities in frontier AI agent monitoring systems.

Principles

Adversarial testing reveals system vulnerabilities.
Third-party evaluators enhance AI security practices.
Collaboration improves monitoring system robustness.

Method

The exercise involved a three-week engagement with substantial access to internal systems, staff, and information, culminating in a detailed report of discovered vulnerabilities and attack artifacts.

In practice

Conduct external red-teaming for AI systems.
Share detailed reports with developers.
Use attack artifacts for system improvement.

Topics

Red Teaming
AI Security
Agent Monitoring Systems
Vulnerability Discovery
Frontier AI
Third-Party Evaluation

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Security Engineer, AI Scientist, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by METR.