Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Health & Medical Research · Depth: Expert, medium

Summary

A new method, Fine-grained Regularized Medical Preference Optimization (FiRe-MPO), addresses critical limitations in applying Large Vision-Language Models (LVLMs) to medical imaging tasks. Existing post-training alignment techniques like Direct Preference Optimization (DPO) struggle with coarse sequence-level reward signals, off-policy distribution shifts from static supervised fine-tuning references, and insufficient visual grounding. FiRe-MPO introduces a bidirectional token-wise KL regularizer to refine clinically important text spans and a visual-contrastive grounding objective that pairs clean and lesion-corrupted images to penalize ungrounded responses. It also constructs on-policy preference pairs by minimally editing model-generated outputs, correcting only clinically erroneous parts while preserving style. This framework achieved an average relative improvement of 10.24% over DPO and RRPO across two state-of-the-art LVLMs, enhancing clinical correctness and visual sensitivity.

Key takeaway

For AI Scientists and Machine Learning Engineers developing medical LVLMs, you should prioritize fine-grained alignment techniques that address both textual and visual grounding. Your current DPO or RRPO implementations may be susceptible to stylistic reward hacking and lack sensitivity to subtle clinical features. Consider integrating token-wise regularization and visual-contrastive objectives, like those in FiRe-MPO, to create on-policy preference pairs. This approach will significantly improve clinical accuracy and reduce factual inconsistencies in your models, making them safer for deployment.

Key insights

FiRe-MPO refines medical LVLM alignment by integrating fine-grained token-level and visual-contrastive preference optimization.

Principles

Medical LVLM alignment needs fine-grained token rewards.
On-policy preference pairs prevent stylistic reward hacking.
Explicit visual grounding improves diagnostic accuracy.

Method

FiRe-MPO uses a bidirectional token-wise KL regularizer and a visual-contrastive grounding objective. It constructs preference pairs by minimally editing model-generated outputs to correct clinical errors.

In practice

Generate preference data by minimally editing model outputs.
Incorporate visual-contrastive training for grounding.
Apply token-level regularization for clinical precision.

Topics

Medical LVLMs
FiRe-MPO
Preference Optimization
Visual Grounding
Clinical Alignment
Direct Preference Optimization

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.