PEAR: Supervised Fine-Tuning Optimized for RL

· Source: The Salt - Curated AI · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Advanced, short

Summary

Two recent studies explore critical aspects of large language model (LLM) training and optimization. The first, "Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning," introduces PEAR, a loss-reweighting scheme for supervised fine-tuning (SFT) that improves post-Reinforcement Learning (RL) performance. Experiments with Qwen and DeepSeek-distilled models on logic games and math benchmarks show that traditional SFT objectives often lead to suboptimal RL initialization, whereas PEAR consistently boosts post-RL outcomes. The second study, "On the Limits of Layer Pruning for Generative Reasoning in LLMs," investigates the impact of layer pruning on 7-8B instruction models like Llama 3.1 and Qwen 2.5. It reveals that while classification accuracy remains robust, generative reasoning tasks (math, code) suffer significant degradation even with minimal pruning. The authors propose Self-Generated Responses (SGR) for fine-tuning pruned models, which improves performance retention compared to using open-source data.

Key takeaway

For AI Engineers and Research Scientists developing or deploying LLMs, understand that standard SFT may not optimally prepare models for subsequent RL, necessitating techniques like PEAR. If you are considering layer pruning for model compression, be aware that generative reasoning capabilities (math, code) are highly susceptible to degradation, even at low pruning ratios. Employ Self-Generated Responses (SGR) as a recovery strategy to retain more performance in pruned models, especially for generative tasks.

Key insights

Optimizing SFT for RL requires aligning offline and online learning, while layer pruning severely impacts LLM generative reasoning.

Principles

Method

PEAR uses token-level likelihood ratios and importance weighting during SFT to align the model with its own policy for better RL initialization. SGR fine-tunes pruned models on responses generated by the unpruned base model.

In practice

Topics

Code references

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Salt - Curated AI.