Reducing Hallucination in Vision-Language Models via Stage-wise Preference Optimization under Distribution Shift

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

A new stage-wise preference optimization framework significantly reduces hallucination in vision-language models (VLMs) without architectural modifications. This approach, instantiated on a LLaMA-3 70B language decoder with a visual encoder, constructs hallucination-focused preference pairs near known failure boundaries, emphasizing ambiguous spatial orientation, object relationships, OCR uncertainty, and adversarial false-premise training. By generating minimally perturbed yet visually inconsistent negative responses, the framework enables Direct Preference Optimization (DPO) to better distinguish grounded reasoning from plausible hallucination. Experiments on open-source benchmarks and real-world multimodal evaluation scenarios demonstrate improved grounding consistency, reduced hallucination, and more informative grounded responses, outperforming several frontier proprietary VLMs like Gemini Flash and GPT-series models in ambiguous spatial reasoning and adversarial false-premise settings.

Key takeaway

For research scientists and VLM developers aiming to enhance model reliability, adopting a stage-wise preference optimization framework can significantly reduce hallucinations. You should focus on constructing targeted preference data that highlights ambiguous reasoning boundaries and adversarial scenarios, rather than solely relying on larger generic datasets. This approach improves grounding consistency and response informativeness, even with relatively small, high-quality preference datasets, offering a path to more robust multimodal AI systems.

Key insights

Stage-wise preference optimization with targeted data construction effectively reduces VLM hallucination by focusing on difficult grounding boundaries.

Principles

Method

The method involves a two-stage training process: first, supervised fine-tuning on large-scale multimodal data for basic grounding, then DPO refinement using hallucination-targeted preference pairs constructed via structured data augmentation.

In practice

Topics

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.