"They screwed us": Personality clashes sent Anthropic's models offline

2026-06-15 · Source: Simon Willison's Weblog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Fundamental Awareness, quick

Summary

An Axios report details that personality clashes and concerns over "jailbreak" resistance led to the US government's directive suspending access to Anthropic's Fable and Mythos models. Key individuals like Logan Graham, Dave Orr, and Nicholas Carlini are reportedly meeting with the Commerce Department to address the situation. The government's action stems from a "potential narrow, non-universal jailbreak" against Claude Mythos, despite Anthropic's claims that no universal jailbreak has been found. Anthropic's "Constitutional Classifiers" work, published in January this year, is relevant to addressing adversarial attacks. The outlook for Fable's return is uncertain, with perfect jailbreak resistance deemed "impossible" by some, suggesting a need for an "attitude fix" to ensure all parties feel "safe, secure and happy." Logan Graham's past role as Special Adviser to the Prime Minister during the Boris Johnson era highlights significant political experience within Anthropic's team.

Key takeaway

For AI/ML Directors navigating regulatory scrutiny, this incident underscores the critical need for robust adversarial attack defenses and proactive government engagement. Your teams should prioritize developing and deploying advanced safety mechanisms, like Constitutional Classifiers, to mitigate jailbreak risks. Furthermore, cultivate strong relationships with regulatory bodies to ensure transparency and address concerns before they escalate into service suspensions, impacting your operational continuity and public trust.

Key insights

Personality clashes and jailbreak concerns prompted a US government directive suspending access to Anthropic's Fable and Mythos models.

Principles

Perfect jailbreak resistance may be impossible.
Political experience aids navigating government directives.
Adversarial attacks require continuous defense evolution.

Method

Anthropic employs "Constitutional Classifiers" to enhance model safety and address adversarial attacks, classifying specific jailbreaks as "narrow" and "non-universal."

In practice

Implement "Constitutional Classifiers" for LLM safety.
Engage government early on AI safety concerns.
Continuously test models for adversarial vulnerabilities.

Topics

Anthropic
AI Safety
Jailbreak Resistance
Constitutional Classifiers
Government Regulation
Large Language Models

Best for: CTO, VP of Engineering/Data, AI Architect, Policy Maker, Tech Journalist, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Simon Willison's Weblog.