Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability
Summary
A new study challenges the common belief that Supervised Fine-Tuning (SFT) primarily leads to memorization, while Reinforcement Learning (RL) fosters generalization in Large Language Models (LLMs). Researchers found that cross-domain generalization in reasoning SFT, particularly with long Chain-of-Thought (CoT) supervision, is conditional. This generalization is influenced by optimization dynamics, training data quality, and the base model's inherent capability. The study observed a "dip-and-recovery" pattern where cross-domain performance initially degrades before improving with extended training, suggesting that short training periods can misrepresent generalization. High-quality, verified long-CoT traces consistently improved cross-domain reasoning, while low-quality solutions were detrimental. Stronger models could internalize transferable procedural patterns, even from simple tasks, unlike weaker models that merely mimicked surface-level verbosity. However, this generalization is asymmetric, improving reasoning but degrading safety.
Key takeaway
For AI Engineers and Research Scientists evaluating LLM post-training strategies, understand that reasoning SFT can achieve cross-domain generalization, but it requires careful consideration of training duration and data quality. Do not prematurely conclude SFT failures based on early training checkpoints, as performance may recover. Focus on curating high-quality, long Chain-of-Thought data and leverage more capable base models to foster robust reasoning generalization, while also monitoring for potential safety degradation.
Key insights
Reasoning SFT can generalize cross-domain, but it is conditional on optimization, data quality, and model capability.
Principles
- Generalization can exhibit a "dip-and-recovery" pattern.
- High-quality CoT data improves cross-domain reasoning.
- Stronger models internalize transferable reasoning patterns.
In practice
- Extend SFT training to overcome initial performance dips.
- Prioritize verified, long CoT traces for training data.
- Use stronger base models for better reasoning generalization.
Topics
- Reasoning SFT
- Cross-domain Generalization
- Chain-of-Thought
- Optimization Dynamics
- Model Capability
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.