Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

2026-04-10 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A new study challenges the common belief that Supervised Fine-Tuning (SFT) primarily leads to memorization, while Reinforcement Learning (RL) fosters generalization in Large Language Models (LLMs). Researchers found that cross-domain generalization in reasoning SFT, particularly with long Chain-of-Thought (CoT) supervision, is conditional. This generalization is influenced by optimization dynamics, training data quality, and the base model's inherent capability. The study observed a "dip-and-recovery" pattern where cross-domain performance initially degrades before improving with extended training, suggesting that short training periods can misrepresent generalization. High-quality, verified long-CoT traces consistently improved cross-domain reasoning, while low-quality solutions were detrimental. Stronger models could internalize transferable procedural patterns, even from simple tasks, unlike weaker models that merely mimicked surface-level verbosity. However, this generalization is asymmetric, improving reasoning but degrading safety.

Key takeaway

For AI Engineers and Research Scientists evaluating LLM post-training strategies, understand that reasoning SFT can achieve cross-domain generalization, but it requires careful consideration of training duration and data quality. Do not prematurely conclude SFT failures based on early training checkpoints, as performance may recover. Focus on curating high-quality, long Chain-of-Thought data and leverage more capable base models to foster robust reasoning generalization, while also monitoring for potential safety degradation.

Key insights

Reasoning SFT can generalize cross-domain, but it is conditional on optimization, data quality, and model capability.

Principles

Generalization can exhibit a "dip-and-recovery" pattern.
High-quality CoT data improves cross-domain reasoning.
Stronger models internalize transferable reasoning patterns.

In practice

Extend SFT training to overcome initial performance dips.
Prioritize verified, long CoT traces for training data.
Use stronger base models for better reasoning generalization.

Topics

Reasoning SFT
Cross-domain Generalization
Chain-of-Thought
Optimization Dynamics
Model Capability

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.