SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

Supervised fine-tuning (SFT) overtraining can lead to "rank inversion" in reinforcement learning with verifiable rewards (RLVR) pipelines, where SFT checkpoints with higher pre-RL pass@1 scores result in significantly worse post-RL performance. This phenomenon, observed in Qwen2.5-Coder-3B, causes peak GRPO pass@10 to fall from 0.806 to 0.481. The mechanism involves "entropy collapse" and "reward variance collapse," where the policy's output diversity diminishes, extinguishing the gradient signal GRPO requires. DeepSeek-Coder-6.7B, however, exhibits "rank compression" without inversion, maintaining pass@1 above the critical p*(8)=0.083 threshold. A two-stage diagnostic, combining pre-RL entropy triage (H(πSFT)<τH=0.18 nats) and an early GRPO entropy monitor (relative drop >τ2=0.50 by step 150), effectively flags high-risk checkpoints. Simple interventions like KL regularization and label smoothing failed to rescue collapsed checkpoints.

Key takeaway

For Machine Learning Engineers selecting supervised fine-tuning (SFT) checkpoints for reinforcement learning with verifiable rewards (RLVR) pipelines, relying solely on the highest pre-RL pass@1 can be detrimental. Your models may experience "rank inversion" and "entropy collapse," leading to significantly degraded post-RL performance. Implement the proposed two-stage diagnostic, using pre-RL entropy triage and early GRPO entropy monitoring, to identify and avoid overtrained SFT checkpoints. This approach can prevent wasted compute and improve the final quality of your code generation models.

Key insights

SFT overtraining causes entropy collapse, leading to GRPO rank inversion and failure, identifiable by pre-RL entropy.

Principles

Method

A two-stage diagnostic flags high-risk SFT checkpoints: (1) pre-RL mean next token entropy triage (H(πSFT)<τH=0.18 nats); (2) early GRPO entropy monitor (relative drop >τ2=0.50 by step 150).

In practice

Topics

Code references

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.