Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training

2026-02-12 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, quick

Summary

A new framework aims to bridge the generalization gap between computationally efficient Supervised Fine-Tuning (SFT) and reinforcement learning (RL) by enabling On-Policy SFT. The core of this framework is the novel Distribution Discriminant Theory (DDT), which provides a quantitative explanation for the alignment between training data and the model's generated distribution. Building on DDT, the authors introduce two techniques: In-Distribution Finetuning (IDFT), a loss-level method designed to improve SFT's generalization, and Hinted Decoding, a data-level technique that re-aligns the training corpus with the model's distribution. Experiments show this framework achieves generalization performance comparable to leading offline RL algorithms like DPO and SimPO, while retaining SFT's efficiency, offering a viable alternative where RL is impractical. The code is open-sourced.

Key takeaway

For AI Engineers and Research Scientists developing large language models, if you are constrained by the computational demands of RL, this On-Policy SFT framework offers a compelling alternative. It allows you to achieve generalization performance on par with DPO and SimPO using the efficiency of an SFT pipeline, making it ideal for resource-limited environments. Explore the open-sourced code to integrate IDFT and Hinted Decoding into your training workflows.

Key insights

On-Policy SFT, enabled by Distribution Discriminant Theory, matches RL generalization with SFT efficiency.

Principles

On-policy data drives RL's superior generalization.
Data-model distribution alignment is quantifiable.

Method

The framework uses Distribution Discriminant Theory (DDT) to inform In-Distribution Finetuning (IDFT) for loss-level enhancement and Hinted Decoding for data-level re-alignment to achieve on-policy SFT.

In practice

Apply IDFT to enhance SFT generalization.
Use Hinted Decoding to re-align training data.

Topics

Supervised Fine-tuning
Reinforcement Learning
Large Language Models
Distribution Discriminant Theory
On-Policy Learning

Code references

zhangmiaosen2000/Towards-On-Policy-SFT

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.