Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?
Summary
A new conceptual framework predicts how training affects the monitorability of a Large Language Model's (LLM) Chain-of-Thought (CoT). The framework models LLM post-training as a Reinforcement Learning (RL) environment where the reward function is decomposed into two terms: one for final outputs and another for the CoT. These terms are classified as "aligned," "orthogonal," or "in-conflict" prior to training. The framework predicts that "in-conflict" terms will reduce CoT monitorability, "orthogonal" terms will have no effect, and "aligned" terms will improve it. Empirical validation across various RL environments confirmed that training with "in-conflict" reward terms reduces CoT monitorability and that optimizing such terms is challenging.
Key takeaway
For research scientists developing and deploying LLMs, understanding the relationship between reward function components and CoT monitorability is crucial. If your training objectives include both final output and CoT-based rewards, you should analyze their alignment using this framework. Prioritize "aligned" or "orthogonal" reward structures to maintain or improve CoT monitorability, as "in-conflict" terms will likely degrade oversight capabilities and complicate optimization.
Key insights
A framework predicts how reward term alignment impacts LLM Chain-of-Thought monitorability during training.
Principles
- Reward terms can be aligned, orthogonal, or in-conflict.
- In-conflict reward terms reduce CoT monitorability.
- Optimizing in-conflict reward terms is difficult.
Method
The method models LLM post-training as an RL environment, decomposing rewards into output and CoT terms, then classifying these terms to predict monitorability changes.
In practice
- Classify reward terms before LLM training.
- Avoid training with "in-conflict" reward terms.
Topics
- Chain-of-Thought
- LLM Monitoring
- Reinforcement Learning
- Reward Alignment
- AI Oversight
Best for: Research Scientist, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.