Learning What to Learn: Stage-Specific Data Sets for SFT-then-RL in Small Language Model Reasoning
Summary
A new difficulty-aware SFT-then-RL framework is proposed for post-training Small Language Models (SLMs) to enhance their reasoning capabilities. The core argument is that Supervised Fine-Tuning (SFT) should focus on acquiring not-yet-mastered reasoning skills, while Reinforcement Learning (RL) should consolidate skills the model can already partially access. This framework organizes training data into stage-specific sets. For challenging SFT samples, a "Bridge mechanism" converts raw teacher-generated reasoning traces into more learnable supervision. When hard samples remain unsolved during RL, "Critique Fine-Tuning" is applied, transforming all-zero-reward failures into diagnostic, repair, and new reasoning trace supervision for the next SFT stage. Experimental results on two SLMs across five reasoning benchmarks consistently show that this method outperforms representative SFT, distillation, and RL baselines, underscoring the importance of aligning data difficulty with each training stage.
Key takeaway
For Machine Learning Engineers optimizing Small Language Models (SLMs) for reasoning, consider adopting a stage-specific data strategy for SFT-then-RL pipelines. Align your SFT data to introduce new, challenging skills and your RL data to refine existing, partially mastered ones. Implement mechanisms like the "Bridge" for SFT and "Critique Fine-Tuning" for RL to dynamically adjust data difficulty and generate targeted supervision. This approach can significantly enhance SLM reasoning performance and efficiency.
Key insights
Data strategy for SFT-then-RL pipelines should align with each stage's distinct role in skill acquisition and consolidation.
Principles
- SFT acquires not-yet-mastered reasoning skills.
- RL consolidates partially accessible skills.
- Coordinate data difficulty across SFT and RL.
Method
A difficulty-aware SFT-then-RL framework uses stage-specific data. It employs a "Bridge mechanism" for SFT's hard samples and "Critique Fine-Tuning" for RL's unsolved hard samples, feeding diagnostics back to SFT.
In practice
- Transform raw teacher traces for SFT.
- Convert RL failures into diagnostic SFT data.
- Tailor data difficulty to SFT and RL stages.
Topics
- Small Language Models
- Supervised Fine-Tuning
- Reinforcement Learning
- Reasoning Benchmarks
- Data Strategy
- Post-training Optimization
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.