SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR
Summary
Supervised fine-tuning (SFT) overtraining can lead to "rank inversion" in reinforcement learning with verifiable rewards (RLVR) pipelines, where SFT checkpoints with higher pre-RL pass@1 scores result in significantly worse post-RL performance. This phenomenon, observed in Qwen2.5-Coder-3B, causes peak GRPO pass@10 to fall from 0.806 to 0.481. The mechanism involves "entropy collapse" and "reward variance collapse," where the policy's output diversity diminishes, extinguishing the gradient signal GRPO requires. DeepSeek-Coder-6.7B, however, exhibits "rank compression" without inversion, maintaining pass@1 above the critical p*(8)=0.083 threshold. A two-stage diagnostic, combining pre-RL entropy triage (H(πSFT)<τH=0.18 nats) and an early GRPO entropy monitor (relative drop >τ2=0.50 by step 150), effectively flags high-risk checkpoints. Simple interventions like KL regularization and label smoothing failed to rescue collapsed checkpoints.
Key takeaway
For Machine Learning Engineers selecting supervised fine-tuning (SFT) checkpoints for reinforcement learning with verifiable rewards (RLVR) pipelines, relying solely on the highest pre-RL pass@1 can be detrimental. Your models may experience "rank inversion" and "entropy collapse," leading to significantly degraded post-RL performance. Implement the proposed two-stage diagnostic, using pre-RL entropy triage and early GRPO entropy monitoring, to identify and avoid overtrained SFT checkpoints. This approach can prevent wasted compute and improve the final quality of your code generation models.
Key insights
SFT overtraining causes entropy collapse, leading to GRPO rank inversion and failure, identifiable by pre-RL entropy.
Principles
- Highest SFT pass@1 can be a misleading predictor for RLVR outcomes.
- Entropy collapse in SFT policies extinguishes GRPO's gradient signal.
- Reward variance E[σG2]=p(1-p)(g-1)/g collapses at low pass@1.
Method
A two-stage diagnostic flags high-risk SFT checkpoints: (1) pre-RL mean next token entropy triage (H(πSFT)<τH=0.18 nats); (2) early GRPO entropy monitor (relative drop >τ2=0.50 by step 150).
In practice
- Implement pre-RL entropy triage to identify overtrained SFT checkpoints.
- Use early GRPO entropy monitoring to halt failing training runs.
- Avoid relying solely on pre-RL pass@1 for SFT checkpoint selection.
Topics
- Supervised Fine-Tuning
- Reinforcement Learning
- Entropy Collapse
- Code Generation Models
- Model Checkpoint Selection
- Gradient Vanishing
Code references
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.