Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs
Summary
Large Vision-Language Models (LVLMs) demonstrate strong performance in medical imaging but struggle with factual inconsistencies, poor visual grounding, and misalignment with clinically meaningful feedback. Existing post-training alignment methods, including Direct Preference Optimization (DPO) variants, are limited by sequence-level reward signals, reliance on static supervised fine-tuning references causing off-policy distribution shifts, and a lack of explicit visual grounding constraints. A new fine-grained, on-policy alignment framework addresses these issues by utilizing a bidirectional token-wise KL regularizer and a visual-contrastive grounding objective. This objective pairs clean and lesion-corrupted images to penalize responses generated without adequate visual evidence. The framework constructs preference pairs by minimally editing model-generated outputs, correcting only clinically erroneous spans while preserving linguistic style. Extensive experiments across medical imaging tasks and clinical text generation benchmarks validate its effectiveness.
Key takeaway
For AI Scientists developing medical LVLMs, existing DPO methods are insufficient due to their coarse reward signals and reliance on static references, leading to clinical inaccuracies. You should implement fine-grained, on-policy alignment frameworks that incorporate token-wise KL regularization and visual-contrastive grounding to ensure factual consistency and diagnostic precision in medical applications. This approach minimizes off-policy distribution shifts and improves clinical relevance.
Key insights
Fine-grained preference optimization improves medical LVLM accuracy by addressing token-level errors and visual grounding.
Principles
- Token-wise reward signals are crucial for clinical accuracy.
- On-policy alignment avoids distribution shifts from static references.
- Explicit visual grounding prevents hallucination in medical LVLMs.
Method
Combines a bidirectional token-wise KL regularizer with a visual-contrastive grounding objective using clean and lesion-corrupted images to penalize ungrounded responses.
In practice
- Use token-level feedback for critical medical text generation.
- Generate preference pairs by minimally editing model outputs.
- Incorporate visual-contrastive training for medical image analysis.
Topics
- Medical LVLMs
- Preference Optimization
- Visual Grounding
- Clinical Text Generation
- Factual Consistency
- Large Vision-Language Models
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.