RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning

2026-06-05 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Rollout-Adaptive Supervised Fine-Tuning (RASFT) is a novel policy-aware framework designed to improve large language models' performance on reasoning tasks, addressing the limitations of traditional Supervised Fine-Tuning (SFT) which can overfit to single expert trajectories. RASFT calibrates expert supervision by estimating problem-level solvability from verified on-policy rollouts. Specifically, it intensifies expert guidance when the current model struggles, while relaxing rigid imitation and integrating correct self-generated trajectories when the model demonstrates reliable reasoning. To prevent excessive policy drift and preserve useful reasoning priors, RASFT incorporates a clipped inverse ratio between a frozen reference model and the current policy. Experimental evaluations across multiple models on six mathematical reasoning benchmarks and two code reasoning benchmarks show that RASFT consistently achieves superior overall performance compared to SFT, its variants, and representative reinforcement learning methods. The project's code is publicly available on GitHub.

Key takeaway

For Machine Learning Engineers fine-tuning large language models for complex reasoning, consider adopting RASFT to move beyond rigid SFT. This method dynamically adjusts supervision, preventing overfitting to single expert paths and leveraging your model's own reliable reasoning. You should explore integrating on-policy rollouts and a clipped inverse ratio to enhance performance on mathematical and code reasoning benchmarks, potentially achieving superior results compared to traditional SFT or RL approaches.

Key insights

RASFT dynamically adjusts expert supervision for reasoning tasks based on a model's real-time performance, preventing overfitting.

Principles

Reasoning requires adaptive, not rigid, imitation.
Calibrate supervision based on policy's current solvability.
Incorporate self-generated correct trajectories and constrain policy drift.

Method

RASFT estimates problem solvability via on-policy rollouts, strengthening expert guidance for struggling policies and integrating self-generated correct trajectories for capable ones, while using a clipped inverse ratio to limit policy drift.

In practice

Apply adaptive supervision to LLM reasoning tasks.
Integrate model's own correct rollouts into training.
Use a reference model to stabilize fine-tuning.

Topics

Large Language Models
Supervised Fine-Tuning
Reasoning Tasks
Policy-Aware Learning
Mathematical Reasoning
Code Reasoning

Code references

zjd1sq/RASFT

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.