Stabilizing On-Policy Distillation for MLLM Reasoning with Global Normalization
Summary
Globally Normalized Distillation Policy Optimization (GNDPO) is a new method designed to stabilize on-policy distillation (OPD) for multimodal large language model (MLLM) reasoning. OPD, a post-training paradigm, provides dense, fine-grained supervision from a stronger teacher model, offering advantages over reinforcement learning with verifiable rewards (RLVR) which relies on sparse feedback. However, traditional token-level distillation in OPD can lead to gradient instability due to magnitude misalignment in outlier states. GNDPO addresses this by transforming raw KL scores into batch-level relative advantages, effectively mitigating gradient explosions while preserving the benefits of token-level guidance. Experimental results demonstrate that GNDPO significantly enhances training robustness and downstream performance across various multimodal reasoning tasks. The code is publicly available.
Key takeaway
For machine learning engineers developing or fine-tuning multimodal large language models, implementing on-policy distillation can be challenging due to gradient instability. You should consider integrating Globally Normalized Distillation Policy Optimization (GNDPO) into your training pipeline. GNDPO's approach of transforming raw KL scores into batch-level relative advantages will stabilize optimization, prevent gradient explosions, and improve training robustness, ultimately enhancing your MLLM's performance on reasoning tasks.
Key insights
GNDPO stabilizes on-policy distillation for MLLM reasoning by normalizing KL scores into batch-level relative advantages.
Principles
- Dense supervision improves over sparse rewards.
- Gradient stability is crucial for distillation.
- Normalization mitigates outlier state issues.
Method
GNDPO transforms raw KL scores into batch-level relative advantages to stabilize optimization, mitigating gradient explosions while retaining token-level guidance benefits.
In practice
- Apply GNDPO to MLLM reasoning tasks.
- Use batch-level relative advantages for stability.
- Implement token-level distillation with normalization.
Topics
- On-Policy Distillation
- MLLM Reasoning
- Gradient Stability
- Policy Optimization
- Multimodal AI
- Knowledge Distillation
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.