When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs
Summary
A new benchmark called HalluScope has been introduced to investigate prompt-induced hallucinations in large vision-language models (LVLMs). This research identifies that LVLM hallucinations primarily arise from an over-reliance on textual priors and background knowledge, particularly when such information is embedded within textual instructions. To address this, the authors propose HalluVL-DPO, a fine-tuning framework that uses preference optimization. HalluVL-DPO guides LVLMs to generate more visually grounded responses by leveraging a specially curated training dataset. The optimized model effectively reduces targeted hallucination failures while maintaining or enhancing performance on existing hallucination benchmarks and visual capability assessments. The benchmark, training dataset, and code will be publicly released to foster further research.
Key takeaway
For AI Engineers and Research Scientists developing or deploying LVLMs, understanding that textual prompt priors significantly induce hallucinations is critical. You should consider integrating HalluVL-DPO or similar preference optimization techniques into your fine-tuning workflows to mitigate these prompt-induced hallucinations, ensuring your models produce more visually accurate and reliable outputs. Evaluate your models using benchmarks like HalluScope to specifically identify and address these failure modes.
Key insights
LVLM hallucinations are largely driven by over-reliance on textual instruction priors, not just vision backbone limits.
Principles
- Textual priors can override visual input.
- Preference optimization improves visual grounding.
Method
HalluVL-DPO fine-tunes LVLMs using preference optimization on a curated dataset, guiding models to prefer visually grounded responses over hallucinated ones.
In practice
- Use HalluScope to evaluate prompt-induced hallucinations.
- Apply HalluVL-DPO for visually grounded LVLM responses.
Topics
- Large Vision-Language Models
- Hallucinations
- HalluScope Benchmark
- Textual Priors
- Preference Optimization
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.