Learning What to Learn: Stage-Specific Data Sets for SFT-then-RL in Small Language Model Reasoning

2026-06-03 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, medium

Summary

A new difficulty-aware SFT-then-RL framework is proposed for post-training Small Language Models (SLMs) to enhance their reasoning capabilities. The core argument is that Supervised Fine-Tuning (SFT) should focus on acquiring not-yet-mastered reasoning skills, while Reinforcement Learning (RL) should consolidate skills the model can already partially access. This framework organizes training data into stage-specific sets. For challenging SFT samples, a "Bridge mechanism" converts raw teacher-generated reasoning traces into more learnable supervision. When hard samples remain unsolved during RL, "Critique Fine-Tuning" is applied, transforming all-zero-reward failures into diagnostic, repair, and new reasoning trace supervision for the next SFT stage. Experimental results on two SLMs across five reasoning benchmarks consistently show that this method outperforms representative SFT, distillation, and RL baselines, underscoring the importance of aligning data difficulty with each training stage.

Key takeaway

For Machine Learning Engineers optimizing Small Language Models (SLMs) for reasoning, consider adopting a stage-specific data strategy for SFT-then-RL pipelines. Align your SFT data to introduce new, challenging skills and your RL data to refine existing, partially mastered ones. Implement mechanisms like the "Bridge" for SFT and "Critique Fine-Tuning" for RL to dynamically adjust data difficulty and generate targeted supervision. This approach can significantly enhance SLM reasoning performance and efficiency.

Key insights

Data strategy for SFT-then-RL pipelines should align with each stage's distinct role in skill acquisition and consolidation.

Principles

SFT acquires not-yet-mastered reasoning skills.
RL consolidates partially accessible skills.
Coordinate data difficulty across SFT and RL.

Method

A difficulty-aware SFT-then-RL framework uses stage-specific data. It employs a "Bridge mechanism" for SFT's hard samples and "Critique Fine-Tuning" for RL's unsolved hard samples, feeding diagnostics back to SFT.

In practice

Transform raw teacher traces for SFT.
Convert RL failures into diagnostic SFT data.
Tailor data difficulty to SFT and RL stages.

Topics

Small Language Models
Supervised Fine-Tuning
Reinforcement Learning
Reasoning Benchmarks
Data Strategy
Post-training Optimization

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.