RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, medium

Summary

Rollout-Adaptive Supervised Fine-Tuning (RASFT) is a novel policy-aware framework designed to enhance large language models' reasoning abilities by addressing the limitations of traditional Supervised Fine-Tuning (SFT). Standard SFT often overfits to single expert demonstrations, suppressing the model's inherent reasoning distribution. RASFT dynamically adjusts expert supervision based on problem-level solvability, estimated from verified on-policy rollouts. For challenging problems, RASFT intensifies expert guidance, while for problems where the model demonstrates reliable reasoning, it relaxes rigid imitation and integrates the model's own correct self-generated trajectories. The framework also incorporates a clipped inverse ratio to prevent excessive policy drift. Experiments across six mathematical and two code reasoning benchmarks show RASFT consistently outperforms SFT, its variants, and RL methods. It achieved a 10.9% relative gain on Qwen2.5-Math-1.5B math reasoning (25.00 to 27.72) and up to 26.9% on Llama-3.2-3B code generation (24.93 to 31.63).

Key takeaway

For Machine Learning Engineers fine-tuning LLMs for complex reasoning tasks, consider adopting a rollout-adaptive approach like RASFT. Your current SFT methods might be limiting model performance by rigidly imitating single expert paths. By dynamically adjusting supervision based on your model's on-policy performance and incorporating self-generated correct solutions, you can activate the LLM's intrinsic reasoning and achieve superior results, as demonstrated by RASFT's gains on math and code benchmarks.

Key insights

RASFT calibrates expert SFT guidance with on-policy rollouts to prevent overfitting and activate LLM's intrinsic reasoning.

Principles

Method

RASFT samples multiple on-policy rollouts, verifies them, and combines with expert trajectories. It estimates problem solvability from rollout success, then dynamically weights expert guidance and self-generated paths, constrained by an inverse policy ratio.

In practice

Topics

Code references

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.