RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning
Summary
Rollout-Adaptive Supervised Fine-Tuning (RASFT) is a novel policy-aware framework designed to improve large language models' performance on reasoning tasks, addressing the limitations of traditional Supervised Fine-Tuning (SFT) which can overfit to single expert trajectories. RASFT calibrates expert supervision by estimating problem-level solvability from verified on-policy rollouts. Specifically, it intensifies expert guidance when the current model struggles, while relaxing rigid imitation and integrating correct self-generated trajectories when the model demonstrates reliable reasoning. To prevent excessive policy drift and preserve useful reasoning priors, RASFT incorporates a clipped inverse ratio between a frozen reference model and the current policy. Experimental evaluations across multiple models on six mathematical reasoning benchmarks and two code reasoning benchmarks show that RASFT consistently achieves superior overall performance compared to SFT, its variants, and representative reinforcement learning methods. The project's code is publicly available on GitHub.
Key takeaway
For Machine Learning Engineers fine-tuning large language models for complex reasoning, consider adopting RASFT to move beyond rigid SFT. This method dynamically adjusts supervision, preventing overfitting to single expert paths and leveraging your model's own reliable reasoning. You should explore integrating on-policy rollouts and a clipped inverse ratio to enhance performance on mathematical and code reasoning benchmarks, potentially achieving superior results compared to traditional SFT or RL approaches.
Key insights
RASFT dynamically adjusts expert supervision for reasoning tasks based on a model's real-time performance, preventing overfitting.
Principles
- Reasoning requires adaptive, not rigid, imitation.
- Calibrate supervision based on policy's current solvability.
- Incorporate self-generated correct trajectories and constrain policy drift.
Method
RASFT estimates problem solvability via on-policy rollouts, strengthening expert guidance for struggling policies and integrating self-generated correct trajectories for capable ones, while using a clipped inverse ratio to limit policy drift.
In practice
- Apply adaptive supervision to LLM reasoning tasks.
- Integrate model's own correct rollouts into training.
- Use a reference model to stabilize fine-tuning.
Topics
- Large Language Models
- Supervised Fine-Tuning
- Reasoning Tasks
- Policy-Aware Learning
- Mathematical Reasoning
- Code Reasoning
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.