Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training
Summary
A new framework aims to bridge the generalization gap between computationally efficient Supervised Fine-Tuning (SFT) and reinforcement learning (RL) by enabling On-Policy SFT. The core of this framework is the novel Distribution Discriminant Theory (DDT), which provides a quantitative explanation for the alignment between training data and the model's generated distribution. Building on DDT, the authors introduce two techniques: In-Distribution Finetuning (IDFT), a loss-level method designed to improve SFT's generalization, and Hinted Decoding, a data-level technique that re-aligns the training corpus with the model's distribution. Experiments show this framework achieves generalization performance comparable to leading offline RL algorithms like DPO and SimPO, while retaining SFT's efficiency, offering a viable alternative where RL is impractical. The code is open-sourced.
Key takeaway
For AI Engineers and Research Scientists developing large language models, if you are constrained by the computational demands of RL, this On-Policy SFT framework offers a compelling alternative. It allows you to achieve generalization performance on par with DPO and SimPO using the efficiency of an SFT pipeline, making it ideal for resource-limited environments. Explore the open-sourced code to integrate IDFT and Hinted Decoding into your training workflows.
Key insights
On-Policy SFT, enabled by Distribution Discriminant Theory, matches RL generalization with SFT efficiency.
Principles
- On-policy data drives RL's superior generalization.
- Data-model distribution alignment is quantifiable.
Method
The framework uses Distribution Discriminant Theory (DDT) to inform In-Distribution Finetuning (IDFT) for loss-level enhancement and Hinted Decoding for data-level re-alignment to achieve on-policy SFT.
In practice
- Apply IDFT to enhance SFT generalization.
- Use Hinted Decoding to re-align training data.
Topics
- Supervised Fine-tuning
- Reinforcement Learning
- Large Language Models
- Distribution Discriminant Theory
- On-Policy Learning
Code references
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.