SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR
Summary
A study on Supervised Fine-Tuning (SFT) overtraining reveals that selecting SFT checkpoints based solely on high pass@1 can lead to "rank inversion" during Group Relative Policy Optimization (GRPO), particularly when SFT compresses the rollout distribution. This phenomenon, termed "entropy collapse," was observed in Qwen2.5-Coder-3B, where increasing SFT depth boosted pre-RL pass@1 but caused peak GRPO pass@10 to fall from 0.806 to 0.481 (3-seed mean, n=20). Pre-RL entropy showed a positive correlation (ρ=+0.69) with GRPO outcomes. In contrast, DeepSeek-Coder-6.7B did not exhibit rank inversion, with GRPO outcomes compressing without inversion, as its pass@1 remained above p*(8)=0.083. A two-stage diagnostic, combining pre-RL entropy triage and an early GRPO entropy monitor, is proposed to flag high-risk checkpoints and prevent failing runs. Standard regularization methods like KL to reference and label smoothing did not resolve the Qwen checkpoint collapse.
Key takeaway
For Machine Learning Engineers optimizing SFT checkpoints for GRPO, you should not rely solely on pre-RL pass@1 metrics. Implement a two-stage diagnostic using pre-RL entropy triage and an early GRPO entropy monitor to detect potential rank inversion caused by SFT overtraining and entropy collapse. This approach helps you identify and stop failing runs early, preventing significant performance degradation in models like Qwen2.5-Coder-3B.
Key insights
SFT overtraining can cause rank inversion in GRPO via entropy collapse, despite high pre-RL pass@1.
Principles
- Pre-RL entropy positively correlates with GRPO outcomes.
- High SFT depth can lead to rollout distribution compression.
Method
A two-stage diagnostic combines pre-RL entropy triage with an early GRPO entropy monitor to identify and stop high-risk SFT checkpoints.
In practice
- Monitor pre-RL entropy before GRPO.
- Implement early GRPO entropy monitoring.
Topics
- SFT Overtraining
- Rank Inversion
- Entropy Collapse
- GRPO Optimization
- Qwen2.5-Coder-3B
- DeepSeek-Coder-6.7B
- RLHF Diagnostics
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.