Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes

2026-04-14 · Source: AI Alignment Forum · Field: Technology & Digital — Artificial Intelligence & Machine Learning, AI Safety & Alignment · Depth: Expert, medium

Summary

Anthropic has repeatedly experienced technical errors where their AI models, including Claude Mythos Preview, Opus 4.6, Sonnet 4.6, and Opus 4.7, were accidentally trained against their own Chain of Thought (CoT). The most recent incident involved approximately 8% of training episodes for Mythos, Opus 4.6, and Sonnet 4.6, affecting GUI computer use, office tasks, and STEM environments. This issue, which went unnoticed for an extended period, is at least the second independent occurrence of CoT exposure to the oversight signal. A previous error also affected Opus 4.6, and Opus 4 had CoT exposure due to unclear internal priorities. Such failures reduce confidence in the monitorability of the model's reasoning trace, which is critical for safely navigating future, more powerful AI deployments and ensuring the trustworthiness of safety assessments.

Key takeaway

For AI development teams prioritizing safety and alignment, these incidents highlight the critical need to fortify your development processes. Your organization should implement stringent, multi-layered checks to prevent accidental training against internal reasoning traces, as such failures undermine model monitorability and the reliability of safety evaluations. Proactively invest in process improvements now, while the consequences of such errors are reputational rather than catastrophic, to ensure future deployments remain trustworthy.

Key insights

Repeated accidental training against Chain of Thought (CoT) indicates critical process failures in AI safety.

Principles

Robust processes are crucial for AI development.
Monitorability of AI reasoning is essential for safety.
Transparency in reporting incidents builds trust.

Method

Implement rigorous auditing of reward functions and reward model inputs, and test CoT exposure at scale by sampling trajectories and modifying CoT to check reward function output.

In practice

Audit reward functions for CoT exposure.
Test CoT independence from reward signals.
Prioritize clear communication in safety protocols.

Topics

Chain of Thought
AI Alignment
Anthropic Models
Development Processes
AI Safety Assessments

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, Machine Learning Engineer, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Alignment Forum.