GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

2026-04-15 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Research Methodology & Innovation · Depth: Expert, quick

Summary

A new post-training framework called Group Fine-Tuning (GFT) has been introduced to address limitations in current large language model (LLM) fine-tuning methods. Traditional supervised fine-tuning (SFT) is analyzed as a policy gradient optimization with sparse implicit rewards and unstable inverse-probability weighting, leading to issues like single-path dependency, entropy collapse, and gradient explosion. GFT mitigates these problems through two core mechanisms: Group Advantage Learning, which generates diverse response groups and uses normalized contrastive supervision to reduce reward sparsity, and Dynamic Coefficient Rectification, which adaptively limits inverse-probability weights to stabilize the optimization process. Experimental results indicate that GFT consistently outperforms SFT-based approaches and produces policies that integrate more effectively with subsequent reinforcement learning (RL) training.

Key takeaway

For research scientists developing or fine-tuning large language models, GFT offers a more stable and effective post-training alternative to traditional SFT. You should consider implementing GFT to improve generalization and ensure smoother integration with subsequent reinforcement learning stages, potentially leading to more robust and performant LLMs. This framework directly addresses common SFT pitfalls like gradient instability and entropy collapse.

Key insights

GFT unifies LLM post-training by addressing SFT's limitations through group advantages and dynamic weight rectification.

Principles

SFT can be viewed as policy gradient optimization.
Reward sparsity leads to single-path dependency.
Unstable weighting causes gradient explosion.

Method

GFT uses Group Advantage Learning for diverse response groups and normalized contrastive supervision, alongside Dynamic Coefficient Rectification to bound inverse-probability weights for stable optimization.

In practice

Integrates SFT and RL training more smoothly.
Alleviates reward sparsity in LLM fine-tuning.

Topics

Group Fine-Tuning
LLM Post-training
Supervised Fine-Tuning
Reinforcement Learning
Group Advantage Learning

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.