Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes
Summary
Anthropic has repeatedly experienced technical errors where their AI models, including Claude Mythos Preview, Opus 4.6, Sonnet 4.6, and Opus 4.7, were accidentally trained against their own Chain of Thought (CoT). The most recent incident involved approximately 8% of training episodes for Mythos, Opus 4.6, and Sonnet 4.6, affecting GUI computer use, office tasks, and STEM environments. This issue, which went unnoticed for an extended period, is at least the second independent occurrence of CoT exposure to the oversight signal. A previous error also affected Opus 4.6, and Opus 4 had CoT exposure due to unclear internal priorities. Such failures reduce confidence in the monitorability of the model's reasoning trace, which is critical for safely navigating future, more powerful AI deployments and ensuring the trustworthiness of safety assessments.
Key takeaway
For AI development teams prioritizing safety and alignment, these incidents highlight the critical need to fortify your development processes. Your organization should implement stringent, multi-layered checks to prevent accidental training against internal reasoning traces, as such failures undermine model monitorability and the reliability of safety evaluations. Proactively invest in process improvements now, while the consequences of such errors are reputational rather than catastrophic, to ensure future deployments remain trustworthy.
Key insights
Repeated accidental training against Chain of Thought (CoT) indicates critical process failures in AI safety.
Principles
- Robust processes are crucial for AI development.
- Monitorability of AI reasoning is essential for safety.
- Transparency in reporting incidents builds trust.
Method
Implement rigorous auditing of reward functions and reward model inputs, and test CoT exposure at scale by sampling trajectories and modifying CoT to check reward function output.
In practice
- Audit reward functions for CoT exposure.
- Test CoT independence from reward signals.
- Prioritize clear communication in safety protocols.
Topics
- Chain of Thought
- AI Alignment
- Anthropic Models
- Development Processes
- AI Safety Assessments
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, Machine Learning Engineer, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI Alignment Forum.