Reducing Hallucination in Vision-Language Models via Stage-wise Preference Optimization under Distribution Shift
Summary
A new stage-wise preference optimization framework significantly reduces hallucination in vision-language models (VLMs) without architectural modifications. This approach, instantiated on a LLaMA-3 70B language decoder with a visual encoder, constructs hallucination-focused preference pairs near known failure boundaries, emphasizing ambiguous spatial orientation, object relationships, OCR uncertainty, and adversarial false-premise training. By generating minimally perturbed yet visually inconsistent negative responses, the framework enables Direct Preference Optimization (DPO) to better distinguish grounded reasoning from plausible hallucination. Experiments on open-source benchmarks and real-world multimodal evaluation scenarios demonstrate improved grounding consistency, reduced hallucination, and more informative grounded responses, outperforming several frontier proprietary VLMs like Gemini Flash and GPT-series models in ambiguous spatial reasoning and adversarial false-premise settings.
Key takeaway
For research scientists and VLM developers aiming to enhance model reliability, adopting a stage-wise preference optimization framework can significantly reduce hallucinations. You should focus on constructing targeted preference data that highlights ambiguous reasoning boundaries and adversarial scenarios, rather than solely relying on larger generic datasets. This approach improves grounding consistency and response informativeness, even with relatively small, high-quality preference datasets, offering a path to more robust multimodal AI systems.
Key insights
Stage-wise preference optimization with targeted data construction effectively reduces VLM hallucination by focusing on difficult grounding boundaries.
Principles
- Hallucination often arises from autoregressive models favoring linguistic plausibility over visual evidence.
- Data quality and preference structure are more critical than raw volume for alignment-oriented objectives.
- Pairwise evaluation offers a more reliable signal for hallucination assessment than pointwise metrics.
Method
The method involves a two-stage training process: first, supervised fine-tuning on large-scale multimodal data for basic grounding, then DPO refinement using hallucination-targeted preference pairs constructed via structured data augmentation.
In practice
- Construct preference pairs near decision boundaries for high-information training signals.
- Use visual prompting as an initial guard layer to encourage structured image examination.
- Incorporate adversarial false-premise training to penalize compliance with incorrect queries.
Topics
- Vision-Language Models
- Hallucination Reduction
- Stage-wise Preference Optimization
- Direct Preference Optimization
- Multimodal Data Augmentation
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.