How will we do SFT on models with opaque reasoning?
Summary
This analysis explores the challenges of applying Supervised Fine-Tuning (SFT) to future AI models that may exhibit opaque reasoning, contrasting with current Large Language Models (LLMs) that often externalize reasoning in human-interpretable language. The author speculates on potential forms of opaque reasoning, such as CoT drifting into alien languages after extensive Reinforcement Learning (RL), recurrent neural networks not initialized from language priors, or text diffusion models. The post evaluates six SFT techniques for opaque reasoning models, rating their effectiveness from "really bad" (๐ด) to "promising" (๐ข). Techniques like SFT with an empty reasoning trace or training reasoning encoders/decoders are deemed ineffective, while generating "almost on-policy" SFT labels by editing model outputs is considered promising. The discussion also highlights the potential for opaque reasoning to be differentially detrimental to control-focused training methods compared to general capabilities.
Key takeaway
For research scientists developing AI control mechanisms, the potential shift to opaque reasoning in future models necessitates a re-evaluation of SFT strategies. You should prioritize exploring methods like editing "almost on-policy" model completions for SFT, or entirely abandoning SFT in favor of alternative training-based control techniques, especially when dealing with models where reasoning is treated as opaque. This adjustment is crucial for developing robust control against increasingly complex AI systems.
Key insights
Opaque reasoning in future AI models poses significant challenges for current SFT-based control methods.
Principles
- Opaque reasoning may arise from RL.
- SFT effectiveness varies greatly with reasoning opacity.
- Off-policy data degrades model performance.
Method
The article evaluates SFT techniques for opaque reasoning, including normal SFT, SFT with empty reasoning traces, generating off-policy or "almost on-policy" labels, and training reasoning encoders/decoders.
In practice
- Consider editing model outputs for "almost on-policy" SFT.
- Explore non-SFT alternatives for training-based control.
- Focus research on non-reasoning malign inits.
Topics
- Opaque Reasoning
- Supervised Fine-Tuning
- AI Alignment
- Training-based Control
- Large Language Models
Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI Alignment Forum.