RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Rollout-Adaptive Supervised Fine-Tuning (RASFT) is a novel policy-aware framework designed to improve large language models' performance on reasoning tasks, addressing the limitations of traditional Supervised Fine-Tuning (SFT) which can overfit to single expert trajectories. RASFT calibrates expert supervision by estimating problem-level solvability from verified on-policy rollouts. Specifically, it intensifies expert guidance when the current model struggles, while relaxing rigid imitation and integrating correct self-generated trajectories when the model demonstrates reliable reasoning. To prevent excessive policy drift and preserve useful reasoning priors, RASFT incorporates a clipped inverse ratio between a frozen reference model and the current policy. Experimental evaluations across multiple models on six mathematical reasoning benchmarks and two code reasoning benchmarks show that RASFT consistently achieves superior overall performance compared to SFT, its variants, and representative reinforcement learning methods. The project's code is publicly available on GitHub.

Key takeaway

For Machine Learning Engineers fine-tuning large language models for complex reasoning, consider adopting RASFT to move beyond rigid SFT. This method dynamically adjusts supervision, preventing overfitting to single expert paths and leveraging your model's own reliable reasoning. You should explore integrating on-policy rollouts and a clipped inverse ratio to enhance performance on mathematical and code reasoning benchmarks, potentially achieving superior results compared to traditional SFT or RL approaches.

Key insights

RASFT dynamically adjusts expert supervision for reasoning tasks based on a model's real-time performance, preventing overfitting.

Principles

Method

RASFT estimates problem solvability via on-policy rollouts, strengthening expert guidance for struggling policies and integrating self-generated correct trajectories for capable ones, while using a clipped inverse ratio to limit policy drift.

In practice

Topics

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.