RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning

2026-05-21 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, medium

Summary

Rollout-Adaptive Supervised Fine-Tuning (RASFT) is a novel policy-aware framework designed to enhance large language models' reasoning abilities by addressing the limitations of traditional Supervised Fine-Tuning (SFT). Standard SFT often overfits to single expert demonstrations, suppressing the model's inherent reasoning distribution. RASFT dynamically adjusts expert supervision based on problem-level solvability, estimated from verified on-policy rollouts. For challenging problems, RASFT intensifies expert guidance, while for problems where the model demonstrates reliable reasoning, it relaxes rigid imitation and integrates the model's own correct self-generated trajectories. The framework also incorporates a clipped inverse ratio to prevent excessive policy drift. Experiments across six mathematical and two code reasoning benchmarks show RASFT consistently outperforms SFT, its variants, and RL methods. It achieved a 10.9% relative gain on Qwen2.5-Math-1.5B math reasoning (25.00 to 27.72) and up to 26.9% on Llama-3.2-3B code generation (24.93 to 31.63).

Key takeaway

For Machine Learning Engineers fine-tuning LLMs for complex reasoning tasks, consider adopting a rollout-adaptive approach like RASFT. Your current SFT methods might be limiting model performance by rigidly imitating single expert paths. By dynamically adjusting supervision based on your model's on-policy performance and incorporating self-generated correct solutions, you can activate the LLM's intrinsic reasoning and achieve superior results, as demonstrated by RASFT's gains on math and code benchmarks.

Key insights

RASFT calibrates expert SFT guidance with on-policy rollouts to prevent overfitting and activate LLM's intrinsic reasoning.

Principles

Reasoning benefits from adaptive, not rigid, expert imitation.
Policy's problem-level ability should calibrate supervision strength.
Preserve useful reasoning priors via policy drift constraints.

Method

RASFT samples multiple on-policy rollouts, verifies them, and combines with expert trajectories. It estimates problem solvability from rollout success, then dynamically weights expert guidance and self-generated paths, constrained by an inverse policy ratio.

In practice

Evaluate problem solvability using model-generated rollouts.
Dynamically weight expert and self-generated correct trajectories.
Employ a frozen reference model to limit policy divergence.

Topics

Supervised Fine-Tuning
Large Language Models
Reasoning Tasks
On-Policy Rollouts
Model Calibration
Code Reasoning

Code references

zjd1sq/RASFT

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.