BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning
Summary
BalCapRL introduces a balanced reinforcement learning (RL) framework for multimodal large language model (MLLM) image captioning, addressing the limitations of existing methods that often prioritize narrow caption quality metrics. Current utility-oriented objectives can lead to noisy or overlong captions, while arena-style objectives may produce fluent but generic descriptions. The proposed BalCapRL framework jointly optimizes utility-aware correctness, reference coverage, and linguistic quality. It employs GDPO-style reward-decoupled normalization for continuous-valued captioning rewards, which outperforms vanilla GRPO, and integrates length-conditional reward masking for a more appropriate length penalty. This method consistently improves caption quality across LLaVA-1.5-7B, Qwen2.5-VL 3B, and Qwen2.5-VL 7B base models, achieving peak gains of +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena.
Key takeaway
For research scientists developing MLLM image captioning systems, BalCapRL demonstrates that a balanced, multi-objective RL framework can significantly improve caption quality across diverse metrics. You should consider adopting GDPO-style reward normalization and length-conditional reward masking to achieve superior performance in correctness, coverage, and linguistic fluency, moving beyond single-metric optimization.
Key insights
BalCapRL balances image captioning quality by jointly optimizing correctness, coverage, and linguistic fluency via a novel RL framework.
Principles
- Multi-objective optimization improves caption quality.
- Reward-decoupled normalization enhances RL performance.
- Length-conditional masking refines caption length penalties.
Method
BalCapRL applies GDPO-style reward-decoupled normalization to continuous multi-objective rewards, combining utility-aware correctness, reference coverage, and linguistic quality, and introduces length-conditional reward masking for improved length penalties.
In practice
- Apply GDPO-style normalization for continuous rewards.
- Integrate length-conditional reward masking.
- Evaluate captions across multiple quality dimensions.
Topics
- Image Captioning
- Reinforcement Learning
- Multimodal Large Language Models
- BalCapRL Framework
- Multi-objective Optimization
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Apple Machine Learning Research.