Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs
Summary
A new method, Fine-grained Regularized Medical Preference Optimization (FiRe-MPO), addresses critical limitations in applying Large Vision-Language Models (LVLMs) to medical imaging tasks. Existing post-training alignment techniques like Direct Preference Optimization (DPO) struggle with coarse sequence-level reward signals, off-policy distribution shifts from static supervised fine-tuning references, and insufficient visual grounding. FiRe-MPO introduces a bidirectional token-wise KL regularizer to refine clinically important text spans and a visual-contrastive grounding objective that pairs clean and lesion-corrupted images to penalize ungrounded responses. It also constructs on-policy preference pairs by minimally editing model-generated outputs, correcting only clinically erroneous parts while preserving style. This framework achieved an average relative improvement of 10.24% over DPO and RRPO across two state-of-the-art LVLMs, enhancing clinical correctness and visual sensitivity.
Key takeaway
For AI Scientists and Machine Learning Engineers developing medical LVLMs, you should prioritize fine-grained alignment techniques that address both textual and visual grounding. Your current DPO or RRPO implementations may be susceptible to stylistic reward hacking and lack sensitivity to subtle clinical features. Consider integrating token-wise regularization and visual-contrastive objectives, like those in FiRe-MPO, to create on-policy preference pairs. This approach will significantly improve clinical accuracy and reduce factual inconsistencies in your models, making them safer for deployment.
Key insights
FiRe-MPO refines medical LVLM alignment by integrating fine-grained token-level and visual-contrastive preference optimization.
Principles
- Medical LVLM alignment needs fine-grained token rewards.
- On-policy preference pairs prevent stylistic reward hacking.
- Explicit visual grounding improves diagnostic accuracy.
Method
FiRe-MPO uses a bidirectional token-wise KL regularizer and a visual-contrastive grounding objective. It constructs preference pairs by minimally editing model-generated outputs to correct clinical errors.
In practice
- Generate preference data by minimally editing model outputs.
- Incorporate visual-contrastive training for grounding.
- Apply token-level regularization for clinical precision.
Topics
- Medical LVLMs
- FiRe-MPO
- Preference Optimization
- Visual Grounding
- Clinical Alignment
- Direct Preference Optimization
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.