Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes
Summary
Anthropic experienced at least two independent incidents where its AI models, including Claude Mythos Preview, Opus 4.6, and Sonnet 4.6, were accidentally trained against their own chain of thought (CoT) in approximately 8% of training episodes. This technical error, which went unnoticed for a significant period, reduces confidence in the monitorability of the models' reasoning traces, making it harder to discern an AI's intent to misbehave. A previous, smaller incident also affected Opus 4.6, and Opus 4 had CoTs exposed due to unclear priorities. The author emphasizes that such failures could jeopardize the safe navigation of an intelligence explosion and highlights the importance of robust development processes, especially as AI labor becomes more prevalent. Anthropic has transparently reported these issues, which is acknowledged as beneficial for external scrutiny and trust.
Key takeaway
For AI development teams focused on safety and alignment, you must prioritize and implement robust process controls to prevent accidental training against chain of thought (CoT). Your current development processes may not be reliable enough for rapid AI progress, risking compromised safety assessments and untrustworthy model behavior. Invest in automated testing and clear communication protocols to ensure CoT monitorability, especially as models become more capable and development scales with AI assistance.
Key insights
Accidental AI model training against its own chain of thought compromises monitorability and safety.
Principles
- Robust development processes are critical for AI safety.
- Transparency in AI incidents builds external trust.
- Monitorability of AI reasoning is crucial for risk assessment.
Method
Auditing reward functions, sampling random trajectories, and testing CoT independence from reward function output can detect CoT exposure issues.
In practice
- Implement rigorous auditing of reward function inputs.
- Conduct automated tests for CoT independence.
- Establish clear responsibilities for environment integrity.
Topics
- Chain of Thought
- Anthropic AI Models
- AI Safety
- Model Alignment
- Training Processes
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, MLOps Engineer, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Redwood Research blog.