GEASS: Gated Evidence-Adaptive Selective Caption Trust for Vision-Language Models
Summary
GEASS (Gated Evidence-Aware Selective Steering) is a novel, training-free module designed to mitigate object hallucination in Vision-Language Models (VLMs) by adaptively controlling the influence of self-generated captions. Previous approaches often treat captions as uniformly positive, but this work reveals that naively embedding them can degrade VLM accuracy, exemplified by a nearly 10-point drop for Qwen2.5-VL-3B on HallusionBench (from 61.19 to 51.31). This degradation stems from a "deep anchoring effect" where captions reshape reasoning, and an "asymmetric error structure" where omissions are frequent but mild, while fabrications are rare but highly damaging. GEASS addresses this by performing two forward passes per query, using a confidence gate, an information-gain weight, and a disagreement penalty to selectively fuse caption logits. It consistently improves performance over vanilla inference and contrastive decoding on benchmarks like POPE and HallusionBench across models such as Qwen2.5-VL-3B and InternVL3-3.8B, with only two extra forward passes.
Key takeaway
For AI Scientists and Machine Learning Engineers working on Vision-Language Models, you should critically evaluate how auxiliary text, like self-generated captions, influences your models. Instead of unconditionally embedding captions, consider adaptive steering mechanisms like GEASS. This training-free approach mitigates object hallucination and improves accuracy on benchmarks such as HallusionBench. It offers a practical way to enhance VLM reliability without costly retraining or architectural changes.
Key insights
Naively using VLM-generated captions can degrade accuracy due to anchoring effects and asymmetric error types.
Principles
- Captions exert a "deep anchoring effect" on VLM reasoning and lexical choices.
- Caption errors are "structurally asymmetric": omissions are common but mild, fabrications are rare but highly impactful.
- A caption's usefulness is a per-query property, not a per-corpus one.
Method
GEASS performs dual-path inference, combining clean and caption-augmented logits. It uses a confidence gate, an information-gain weight, and a disagreement penalty to adaptively regulate caption influence at the logit level.
In practice
- GEASS is plug-and-play and requires no architectural modifications or retraining.
- It adds only two extra forward passes per query, compatible with any VLM exposing decoding logits.
Topics
- Vision-Language Models
- Object Hallucination
- Caption Steering
- Inference-time Mitigation
- Logit Fusion
- Qwen2.5-VL-3B
- HallusionBench
Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.