SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

2026-06-16 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

A study on Supervised Fine-Tuning (SFT) overtraining reveals that selecting SFT checkpoints based solely on high pass@1 can lead to "rank inversion" during Group Relative Policy Optimization (GRPO), particularly when SFT compresses the rollout distribution. This phenomenon, termed "entropy collapse," was observed in Qwen2.5-Coder-3B, where increasing SFT depth boosted pre-RL pass@1 but caused peak GRPO pass@10 to fall from 0.806 to 0.481 (3-seed mean, n=20). Pre-RL entropy showed a positive correlation (ρ=+0.69) with GRPO outcomes. In contrast, DeepSeek-Coder-6.7B did not exhibit rank inversion, with GRPO outcomes compressing without inversion, as its pass@1 remained above p*(8)=0.083. A two-stage diagnostic, combining pre-RL entropy triage and an early GRPO entropy monitor, is proposed to flag high-risk checkpoints and prevent failing runs. Standard regularization methods like KL to reference and label smoothing did not resolve the Qwen checkpoint collapse.

Key takeaway

For Machine Learning Engineers optimizing SFT checkpoints for GRPO, you should not rely solely on pre-RL pass@1 metrics. Implement a two-stage diagnostic using pre-RL entropy triage and an early GRPO entropy monitor to detect potential rank inversion caused by SFT overtraining and entropy collapse. This approach helps you identify and stop failing runs early, preventing significant performance degradation in models like Qwen2.5-Coder-3B.

Key insights

SFT overtraining can cause rank inversion in GRPO via entropy collapse, despite high pre-RL pass@1.

Principles

Pre-RL entropy positively correlates with GRPO outcomes.
High SFT depth can lead to rollout distribution compression.

Method

A two-stage diagnostic combines pre-RL entropy triage with an early GRPO entropy monitor to identify and stop high-risk SFT checkpoints.

In practice

Monitor pre-RL entropy before GRPO.
Implement early GRPO entropy monitoring.

Topics

SFT Overtraining
Rank Inversion
Entropy Collapse
GRPO Optimization
Qwen2.5-Coder-3B
DeepSeek-Coder-6.7B
RLHF Diagnostics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.