Stage-1 Controls the Entropy Regime, Not the Outcome

2026-06-08 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

A study investigating two-stage post-training for vision-language models (VLMs) using Qwen2.5-VL-7B found that the Stage-1 warm-start, whether supervised fine-tuning (SFT) or on-policy distillation (OPD), primarily controls the "entropy regime" rather than the final performance outcome. The research observed that three different warm-starts achieved a narrow 53-54% band on Geometry3K internal validation, indicating little change in the in-domain endpoint. While an early-stopped SFT improved out-of-domain MathVista by +2.1 points, reversing a -9.5-point drop from an over-trained variant, the most significant difference was OPD entering reinforcement learning with substantially higher policy entropy than SFT. Although OPD initially showed higher answer diversity and pass@16 (+2.0 to +5.2 points) in-domain, this advantage was absent after RL, with endpoint pass@16 values within 1.1 points and MathVista scores within 1.2 points across models.

Key takeaway

For Machine Learning Engineers optimizing VLM post-training, understand that your Stage-1 warm-start choice, whether SFT or OPD, mainly influences the policy's entropy regime rather than significantly altering final in-domain performance. You should prioritize early stopping for SFT to potentially improve out-of-domain generalization, as over-training can severely degrade performance. Do not assume higher initial policy entropy from OPD automatically translates to superior downstream reinforcement learning results.

Key insights

Stage-1 warm-starts in VLM post-training primarily control policy entropy, not the final performance outcome.

Principles

Stage-1 warm-starts yield similar in-domain endpoints.
Early stopping SFT can improve out-of-domain performance.
Higher policy entropy from OPD doesn't guarantee better RL outcomes.

In practice

Consider early stopping SFT for VLM fine-tuning.
Monitor policy entropy during Stage-1 warm-starts.
Evaluate Stage-1 impact on out-of-domain tasks.

Topics

Vision-Language Models
Post-training
Supervised Fine-tuning
On-Policy Distillation
Reinforcement Learning
Policy Entropy
Qwen2.5-VL-7B

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.