Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training
Summary
Large-scale Vision-Language Models (VLMs) post-trained with Reinforcement Learning (RL) exhibit superior out-of-distribution (OOD) generalization compared to those trained with Supervised Fine-Tuning (SFT). This phenomenon, observed on February 11, 2026, is attributed to RL's implicit data filtering, which prioritizes medium-difficulty training samples. Research systematically evaluated SFT models across varying data difficulty levels, confirming that training on hard samples significantly degrades OOD performance. Based on this finding, a new method called Difficulty-Curated SFT (DC-SFT) was introduced. DC-SFT explicitly filters training data by sample difficulty, demonstrating substantial improvements in OOD generalization over standard SFT and even surpassing RL-based training, while offering enhanced stability and computational efficiency. Code for DC-SFT is available on GitHub.
Key takeaway
For research scientists and VLM engineers optimizing model generalization, consider implementing Difficulty-Curated SFT (DC-SFT) in your post-training pipeline. This method, which explicitly filters training data by difficulty, offers a more stable and computationally efficient path to superior out-of-distribution performance than traditional SFT or even RL-based approaches. You should evaluate your training data's difficulty distribution and curate it to prioritize medium-difficulty samples.
Key insights
RL's OOD generalization advantage in VLMs stems from implicitly prioritizing medium-difficulty training data.
Principles
- Hard training samples degrade OOD performance.
- Data difficulty is a critical generalization factor.
Method
Difficulty-Curated SFT (DC-SFT) explicitly filters VLM training data based on sample difficulty to enhance OOD generalization, outperforming standard SFT and RL.
In practice
- Filter training data by difficulty for better VLM generalization.
- Prioritize medium-difficulty samples in VLM fine-tuning.
Topics
- Vision-Language Models
- Reinforcement Learning
- Supervised Fine-Tuning
- OOD Generalization
- Data Filtering
Code references
Best for: Computer Vision Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.