Reasoning Models Struggle to Control their Chains of Thought

2026-01-11 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, AI Safety & Alignment · Depth: Expert, extended

Summary

A new evaluation suite, CoT-Control, measures the ability of reasoning models to control their Chain-of-Thought (CoT) output, a critical factor for detecting potential misbehavior and understanding model motivations. The suite includes 14,076 problems from benchmarks like GPQA and MMLU-Pro, requiring models to solve tasks while adhering to CoT instructions, such as avoiding specific words. Researchers found that reasoning models exhibit significantly lower CoT controllability (e.g., Claude Sonnet 4.5 at 2.7%) compared to controlling their final output (61.9%). CoT controllability increases with model size but decreases with more Reinforcement Learning (RL) training, increased test-time compute, and higher problem difficulty. Even with incentives or adversarial prompting, CoT controllability remains low, suggesting current models are unlikely to intentionally evade CoT monitors, though the underlying mechanism is not fully understood.

Key takeaway

For research scientists developing or deploying advanced AI agents, understanding CoT controllability is crucial for ensuring model safety and monitorability. Your models' low CoT controllability is a positive indicator for robust oversight, as it suggests difficulty in intentionally obfuscating internal reasoning. You should actively track this metric in future models and explore training and inference strategies that maintain or further reduce CoT controllability, especially as models scale in size and capability, to preserve the integrity of CoT-based safety mechanisms.

Key insights

Reasoning models currently struggle to control their internal Chain-of-Thought, making them less likely to intentionally evade monitoring.

Principles

CoT controllability is inversely related to monitorability.
Larger models exhibit higher CoT controllability.
RL training and increased reasoning effort reduce CoT controllability.

Method

The CoT-Control suite evaluates models by requiring them to solve problems while adhering to explicit CoT constraints (e.g., word suppression, stylistic edits) and scoring compliance programmatically.

In practice

Track CoT controllability in new models.
Prioritize training methods that reduce CoT controllability.
Consider test-time compute for safer capability scaling.

Topics

Chain-of-Thought
CoT Controllability
AI Safety
Reinforcement Learning
Large Language Model Evaluation

Code references

YuehHanChen/CoTControl

Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.