A review of “Investigating the consequences of accidentally grading CoT during RL”

· Source: Redwood Research blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, AI Safety & Alignment · Depth: Expert, long

Summary

OpenAI recently disclosed that some of its Reinforcement Learning (RL) training for models like GPT-5.4 Thinking accidentally exposed Chains of Thought (CoT) to graders, a similar issue Anthropic encountered with models like Mythos and Opus 4.6. This occurred in 1.5% of trajectories for GPT-5.4 Thinking, with Anthropic's models showing CoT exposure in 8% of RL episodes. OpenAI developed an automated detection system to prevent future occurrences and investigated whether this accidental CoT grading negatively impacted model monitorability. Their analysis, reviewed by Redwood Research, suggests the training is unlikely to have substantially degraded monitorability, assuaging 80% of initial concerns. However, the review highlights that the evidence does not fully rule out subtle, harder-to-detect degradation, such as the suppression of misaligned goals in CoTs, which could affect future models.

Key takeaway

For CTOs and VPs of Engineering overseeing frontier AI development, this incident underscores the critical need for rigorous internal controls and external validation of safety claims. Your teams should prioritize implementing automated systems to detect and prevent accidental CoT exposure or similar training anomalies. Relying solely on internal assessments may leave subtle, long-term risks unaddressed, potentially impacting future model alignment and monitorability. Consider engaging third-party organizations for independent reviews of your safety evidence to bolster trust and identify blind spots.

Key insights

Accidental CoT exposure during RL training may subtly degrade model monitorability, even if direct impacts are hard to measure.

Principles

Method

OpenAI used an automated system to detect CoT exposure in past training runs and assessed monitorability by analyzing grader reward effects and a dedicated monitorability score, including rerunning training for one model.

In practice

Topics

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, AI Ethicist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Redwood Research blog.