RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning
Summary
RubiCap is a novel reinforcement learning (RL) framework designed for dense image captioning, addressing the high cost of expert annotations and limitations of supervised distillation. It overcomes the challenge of open-ended captioning, where deterministic checkers for RL rewards are unavailable, by generating fine-grained, sample-specific reward signals from LLM-written rubrics. The framework first creates a diverse set of candidate captions, then uses an LLM rubric writer to identify strengths and weaknesses, converting these insights into explicit evaluation criteria. An LLM judge then provides structured, multi-faceted evaluations instead of coarse scalar rewards. RubiCap achieves the highest win rates on CapArena, surpassing supervised distillation, prior RL methods, human-expert annotations, and GPT-4V-augmented outputs. Its 7B model matches Qwen2.5-VL-32B-Instruct in word efficiency on CaptionQA, and its 3B model outperforms the 7B counterpart. Notably, RubiCap-3B produces stronger pretrained VLMs than those trained on proprietary model captions.
Key takeaway
For AI Scientists developing vision-language models, RubiCap offers a robust method to generate high-quality dense image captions without expensive human annotations. You should explore integrating LLM-driven rubric generation into your RL pipelines to create more nuanced reward signals, especially for open-ended tasks where deterministic evaluation is difficult. This approach can yield more efficient and performant VLMs, even with smaller captioning models like RubiCap-3B.
Key insights
RubiCap uses LLM-generated rubrics to provide fine-grained, structured reward signals for dense image captioning via reinforcement learning.
Principles
- LLMs can create explicit evaluation criteria.
- Structured rewards improve RL in open-ended tasks.
- Synthetic captions can outperform human annotations.
Method
RubiCap assembles candidate captions, uses an LLM rubric writer to diagnose policy deficiencies, and converts insights into evaluation criteria for an LLM judge to provide multi-faceted rewards.
In practice
- Use LLMs for complex reward signal generation.
- Apply structured evaluation in RL for creative tasks.
- Consider RubiCap-3B for VLM pretraining.
Topics
- Dense Image Captioning
- Reinforcement Learning
- LLM-based Evaluation
- Vision-Language Models
- Synthetic Captioning
Best for: AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Apple Machine Learning Research.