Stage-1 Controls the Entropy Regime, Not the Outcome
Summary
A study investigating two-stage post-training for vision-language models (VLMs) using Qwen2.5-VL-7B found that the Stage-1 warm-start, whether supervised fine-tuning (SFT) or on-policy distillation (OPD), primarily controls the "entropy regime" rather than the final performance outcome. The research observed that three different warm-starts achieved a narrow 53-54% band on Geometry3K internal validation, indicating little change in the in-domain endpoint. While an early-stopped SFT improved out-of-domain MathVista by +2.1 points, reversing a -9.5-point drop from an over-trained variant, the most significant difference was OPD entering reinforcement learning with substantially higher policy entropy than SFT. Although OPD initially showed higher answer diversity and pass@16 (+2.0 to +5.2 points) in-domain, this advantage was absent after RL, with endpoint pass@16 values within 1.1 points and MathVista scores within 1.2 points across models.
Key takeaway
For Machine Learning Engineers optimizing VLM post-training, understand that your Stage-1 warm-start choice, whether SFT or OPD, mainly influences the policy's entropy regime rather than significantly altering final in-domain performance. You should prioritize early stopping for SFT to potentially improve out-of-domain generalization, as over-training can severely degrade performance. Do not assume higher initial policy entropy from OPD automatically translates to superior downstream reinforcement learning results.
Key insights
Stage-1 warm-starts in VLM post-training primarily control policy entropy, not the final performance outcome.
Principles
- Stage-1 warm-starts yield similar in-domain endpoints.
- Early stopping SFT can improve out-of-domain performance.
- Higher policy entropy from OPD doesn't guarantee better RL outcomes.
In practice
- Consider early stopping SFT for VLM fine-tuning.
- Monitor policy entropy during Stage-1 warm-starts.
- Evaluate Stage-1 impact on out-of-domain tasks.
Topics
- Vision-Language Models
- Post-training
- Supervised Fine-tuning
- On-Policy Distillation
- Reinforcement Learning
- Policy Entropy
- Qwen2.5-VL-7B
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.