SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

2026-06-18 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

Supervised fine-tuning (SFT) overtraining can lead to "rank inversion" in reinforcement learning with verifiable rewards (RLVR) pipelines, where SFT checkpoints with higher pre-RL pass@1 scores result in significantly worse post-RL performance. This phenomenon, observed in Qwen2.5-Coder-3B, causes peak GRPO pass@10 to fall from 0.806 to 0.481. The mechanism involves "entropy collapse" and "reward variance collapse," where the policy's output diversity diminishes, extinguishing the gradient signal GRPO requires. DeepSeek-Coder-6.7B, however, exhibits "rank compression" without inversion, maintaining pass@1 above the critical p*(8)=0.083 threshold. A two-stage diagnostic, combining pre-RL entropy triage (H(πSFT)<τH=0.18 nats) and an early GRPO entropy monitor (relative drop >τ2=0.50 by step 150), effectively flags high-risk checkpoints. Simple interventions like KL regularization and label smoothing failed to rescue collapsed checkpoints.

Key takeaway

For Machine Learning Engineers selecting supervised fine-tuning (SFT) checkpoints for reinforcement learning with verifiable rewards (RLVR) pipelines, relying solely on the highest pre-RL pass@1 can be detrimental. Your models may experience "rank inversion" and "entropy collapse," leading to significantly degraded post-RL performance. Implement the proposed two-stage diagnostic, using pre-RL entropy triage and early GRPO entropy monitoring, to identify and avoid overtrained SFT checkpoints. This approach can prevent wasted compute and improve the final quality of your code generation models.

Key insights

SFT overtraining causes entropy collapse, leading to GRPO rank inversion and failure, identifiable by pre-RL entropy.

Principles

Highest SFT pass@1 can be a misleading predictor for RLVR outcomes.
Entropy collapse in SFT policies extinguishes GRPO's gradient signal.
Reward variance E[σG2]=p(1-p)(g-1)/g collapses at low pass@1.

Method

A two-stage diagnostic flags high-risk SFT checkpoints: (1) pre-RL mean next token entropy triage (H(πSFT)<τH=0.18 nats); (2) early GRPO entropy monitor (relative drop >τ2=0.50 by step 150).

In practice

Implement pre-RL entropy triage to identify overtrained SFT checkpoints.
Use early GRPO entropy monitoring to halt failing training runs.
Avoid relying solely on pre-RL pass@1 for SFT checkpoint selection.

Topics

Supervised Fine-Tuning
Reinforcement Learning
Entropy Collapse
Code Generation Models
Model Checkpoint Selection
Gradient Vanishing

Code references

siddharthaphale/entropy-collapse-rlvr

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.