Stabilizing On-Policy Distillation for MLLM Reasoning with Global Normalization

2026-06-08 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Globally Normalized Distillation Policy Optimization (GNDPO) is a new method designed to stabilize on-policy distillation (OPD) for multimodal large language model (MLLM) reasoning. OPD, a post-training paradigm, provides dense, fine-grained supervision from a stronger teacher model, offering advantages over reinforcement learning with verifiable rewards (RLVR) which relies on sparse feedback. However, traditional token-level distillation in OPD can lead to gradient instability due to magnitude misalignment in outlier states. GNDPO addresses this by transforming raw KL scores into batch-level relative advantages, effectively mitigating gradient explosions while preserving the benefits of token-level guidance. Experimental results demonstrate that GNDPO significantly enhances training robustness and downstream performance across various multimodal reasoning tasks. The code is publicly available.

Key takeaway

For machine learning engineers developing or fine-tuning multimodal large language models, implementing on-policy distillation can be challenging due to gradient instability. You should consider integrating Globally Normalized Distillation Policy Optimization (GNDPO) into your training pipeline. GNDPO's approach of transforming raw KL scores into batch-level relative advantages will stabilize optimization, prevent gradient explosions, and improve training robustness, ultimately enhancing your MLLM's performance on reasoning tasks.

Key insights

GNDPO stabilizes on-policy distillation for MLLM reasoning by normalizing KL scores into batch-level relative advantages.

Principles

Dense supervision improves over sparse rewards.
Gradient stability is crucial for distillation.
Normalization mitigates outlier state issues.

Method

GNDPO transforms raw KL scores into batch-level relative advantages to stabilize optimization, mitigating gradient explosions while retaining token-level guidance benefits.

In practice

Apply GNDPO to MLLM reasoning tasks.
Use batch-level relative advantages for stability.
Implement token-level distillation with normalization.

Topics

On-Policy Distillation
MLLM Reasoning
Gradient Stability
Policy Optimization
Multimodal AI
Knowledge Distillation

Code references

OPPO-Mente-Lab/GNDPO

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.