Fine-tuning experiments on CoT controllability
Summary
Fine-tuning experiments demonstrate that a small amount of instruction-following training significantly enhances Chain-of-Thought (CoT) controllability in reasoning models. Researchers fine-tuned four models, including GPT-OSS-20B, GPT-OSS-120B, Qwen-3-8B, and Qwen-3-32B, using small datasets of 240 examples or approximately 100K-300K tokens. This process led to an average increase in out-of-distribution (OOD) CoT controllability from 2.9% to 8.8% on the CoTControl eval suite. The most notable improvements were observed for instructions requiring specific reasoning casing, word suppression, and the inclusion of provided sentences. While the absolute compliance rate remains low, this finding suggests that even minimal optimization pressure can improve CoT controllability, indicating that current low controllability might not be robust against inadvertent training influences. The study used ReasonIF and CoTControl eval suites, adjusting prompts for consistency and penalization.
Key takeaway
For AI Ethicists and ML Engineers concerned with model safety, you should recognize that even minor fine-tuning can significantly enhance Chain-of-Thought controllability. This implies that models could inadvertently develop the ability to evade CoT monitors, potentially by omitting critical information from reasoning traces. You must proactively stress-test your models for CoT controllability and consider its implications for monitorability, especially as models scale.
Key insights
Small fine-tuning significantly boosts CoT controllability, suggesting its fragility to optimization pressure.
Principles
- CoT controllability is a latent model capability.
- Minimal fine-tuning can meaningfully increase control.
- Controllability gains vary by instruction type.
Method
Fine-tune models on instruction-following reasoning data, then evaluate generalization on out-of-distribution tasks by editing rollouts for compliance.
In practice
- Apply instruction-following fine-tuning for CoT control.
- Target specific instruction types for best gains.
- Monitor CoT for accidental controllability increases.
Topics
- CoT Controllability
- Fine-tuning
- Instruction Following
- Reasoning Models
- Model Safety
- AI Evaluation
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by METR.