How will we do SFT on models with opaque reasoning?

2026-02-21 · Source: AI Alignment Forum · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Advanced, long

Summary

This analysis explores the challenges of applying Supervised Fine-Tuning (SFT) to future AI models that may exhibit opaque reasoning, contrasting with current Large Language Models (LLMs) that often externalize reasoning in human-interpretable language. The author speculates on potential forms of opaque reasoning, such as CoT drifting into alien languages after extensive Reinforcement Learning (RL), recurrent neural networks not initialized from language priors, or text diffusion models. The post evaluates six SFT techniques for opaque reasoning models, rating their effectiveness from "really bad" (🔴) to "promising" (🟢). Techniques like SFT with an empty reasoning trace or training reasoning encoders/decoders are deemed ineffective, while generating "almost on-policy" SFT labels by editing model outputs is considered promising. The discussion also highlights the potential for opaque reasoning to be differentially detrimental to control-focused training methods compared to general capabilities.

Key takeaway

For research scientists developing AI control mechanisms, the potential shift to opaque reasoning in future models necessitates a re-evaluation of SFT strategies. You should prioritize exploring methods like editing "almost on-policy" model completions for SFT, or entirely abandoning SFT in favor of alternative training-based control techniques, especially when dealing with models where reasoning is treated as opaque. This adjustment is crucial for developing robust control against increasingly complex AI systems.

Key insights

Opaque reasoning in future AI models poses significant challenges for current SFT-based control methods.

Principles

Opaque reasoning may arise from RL.
SFT effectiveness varies greatly with reasoning opacity.
Off-policy data degrades model performance.

Method

The article evaluates SFT techniques for opaque reasoning, including normal SFT, SFT with empty reasoning traces, generating off-policy or "almost on-policy" labels, and training reasoning encoders/decoders.

In practice

Consider editing model outputs for "almost on-policy" SFT.
Explore non-SFT alternatives for training-based control.
Focus research on non-reasoning malign inits.

Topics

Opaque Reasoning
Supervised Fine-Tuning
AI Alignment
Training-based Control
Large Language Models

Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Alignment Forum.