Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes

· Source: Redwood Research blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, medium

Summary

Anthropic experienced at least two independent incidents where its AI models, including Claude Mythos Preview, Opus 4.6, and Sonnet 4.6, were accidentally trained against their own chain of thought (CoT) in approximately 8% of training episodes. This technical error, which went unnoticed for a significant period, reduces confidence in the monitorability of the models' reasoning traces, making it harder to discern an AI's intent to misbehave. A previous, smaller incident also affected Opus 4.6, and Opus 4 had CoTs exposed due to unclear priorities. The author emphasizes that such failures could jeopardize the safe navigation of an intelligence explosion and highlights the importance of robust development processes, especially as AI labor becomes more prevalent. Anthropic has transparently reported these issues, which is acknowledged as beneficial for external scrutiny and trust.

Key takeaway

For AI development teams focused on safety and alignment, you must prioritize and implement robust process controls to prevent accidental training against chain of thought (CoT). Your current development processes may not be reliable enough for rapid AI progress, risking compromised safety assessments and untrustworthy model behavior. Invest in automated testing and clear communication protocols to ensure CoT monitorability, especially as models become more capable and development scales with AI assistance.

Key insights

Accidental AI model training against its own chain of thought compromises monitorability and safety.

Principles

Method

Auditing reward functions, sampling random trajectories, and testing CoT independence from reward function output can detect CoT exposure issues.

In practice

Topics

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, MLOps Engineer, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Redwood Research blog.