How will we do SFT on models with opaque reasoning?

ยท Source: AI Alignment Forum ยท Field: Technology & Digital โ€” Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation ยท Depth: Advanced, long

Summary

This analysis explores the challenges of applying Supervised Fine-Tuning (SFT) to future AI models that may exhibit opaque reasoning, contrasting with current Large Language Models (LLMs) that often externalize reasoning in human-interpretable language. The author speculates on potential forms of opaque reasoning, such as CoT drifting into alien languages after extensive Reinforcement Learning (RL), recurrent neural networks not initialized from language priors, or text diffusion models. The post evaluates six SFT techniques for opaque reasoning models, rating their effectiveness from "really bad" (๐Ÿ”ด) to "promising" (๐ŸŸข). Techniques like SFT with an empty reasoning trace or training reasoning encoders/decoders are deemed ineffective, while generating "almost on-policy" SFT labels by editing model outputs is considered promising. The discussion also highlights the potential for opaque reasoning to be differentially detrimental to control-focused training methods compared to general capabilities.

Key takeaway

For research scientists developing AI control mechanisms, the potential shift to opaque reasoning in future models necessitates a re-evaluation of SFT strategies. You should prioritize exploring methods like editing "almost on-policy" model completions for SFT, or entirely abandoning SFT in favor of alternative training-based control techniques, especially when dealing with models where reasoning is treated as opaque. This adjustment is crucial for developing robust control against increasingly complex AI systems.

Key insights

Opaque reasoning in future AI models poses significant challenges for current SFT-based control methods.

Principles

Method

The article evaluates SFT techniques for opaque reasoning, including normal SFT, SFT with empty reasoning traces, generating off-policy or "almost on-policy" labels, and training reasoning encoders/decoders.

In practice

Topics

Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential โ†’

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Alignment Forum.