Reducing Hallucination in Vision-Language Models via Stage-wise Preference Optimization under Distribution Shift

2026-05-19 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

A new stage-wise preference optimization framework significantly reduces hallucination in vision-language models (VLMs) without architectural modifications. This approach, instantiated on a LLaMA-3 70B language decoder with a visual encoder, constructs hallucination-focused preference pairs near known failure boundaries, emphasizing ambiguous spatial orientation, object relationships, OCR uncertainty, and adversarial false-premise training. By generating minimally perturbed yet visually inconsistent negative responses, the framework enables Direct Preference Optimization (DPO) to better distinguish grounded reasoning from plausible hallucination. Experiments on open-source benchmarks and real-world multimodal evaluation scenarios demonstrate improved grounding consistency, reduced hallucination, and more informative grounded responses, outperforming several frontier proprietary VLMs like Gemini Flash and GPT-series models in ambiguous spatial reasoning and adversarial false-premise settings.

Key takeaway

For research scientists and VLM developers aiming to enhance model reliability, adopting a stage-wise preference optimization framework can significantly reduce hallucinations. You should focus on constructing targeted preference data that highlights ambiguous reasoning boundaries and adversarial scenarios, rather than solely relying on larger generic datasets. This approach improves grounding consistency and response informativeness, even with relatively small, high-quality preference datasets, offering a path to more robust multimodal AI systems.

Key insights

Stage-wise preference optimization with targeted data construction effectively reduces VLM hallucination by focusing on difficult grounding boundaries.

Principles

Hallucination often arises from autoregressive models favoring linguistic plausibility over visual evidence.
Data quality and preference structure are more critical than raw volume for alignment-oriented objectives.
Pairwise evaluation offers a more reliable signal for hallucination assessment than pointwise metrics.

Method

The method involves a two-stage training process: first, supervised fine-tuning on large-scale multimodal data for basic grounding, then DPO refinement using hallucination-targeted preference pairs constructed via structured data augmentation.

In practice

Construct preference pairs near decision boundaries for high-information training signals.
Use visual prompting as an initial guard layer to encourage structured image examination.
Incorporate adversarial false-premise training to penalize compliance with incorrect queries.

Topics

Vision-Language Models
Hallucination Reduction
Stage-wise Preference Optimization
Direct Preference Optimization
Multimodal Data Augmentation

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.