GEASS: Gated Evidence-Adaptive Selective Caption Trust for Vision-Language Models

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

GEASS (Gated Evidence-Aware Selective Steering) is a novel, training-free module designed to mitigate object hallucination in Vision-Language Models (VLMs) by adaptively controlling the influence of self-generated captions. Previous approaches often treat captions as uniformly positive, but this work reveals that naively embedding them can degrade VLM accuracy, exemplified by a nearly 10-point drop for Qwen2.5-VL-3B on HallusionBench (from 61.19 to 51.31). This degradation stems from a "deep anchoring effect" where captions reshape reasoning, and an "asymmetric error structure" where omissions are frequent but mild, while fabrications are rare but highly damaging. GEASS addresses this by performing two forward passes per query, using a confidence gate, an information-gain weight, and a disagreement penalty to selectively fuse caption logits. It consistently improves performance over vanilla inference and contrastive decoding on benchmarks like POPE and HallusionBench across models such as Qwen2.5-VL-3B and InternVL3-3.8B, with only two extra forward passes.

Key takeaway

For AI Scientists and Machine Learning Engineers working on Vision-Language Models, you should critically evaluate how auxiliary text, like self-generated captions, influences your models. Instead of unconditionally embedding captions, consider adaptive steering mechanisms like GEASS. This training-free approach mitigates object hallucination and improves accuracy on benchmarks such as HallusionBench. It offers a practical way to enhance VLM reliability without costly retraining or architectural changes.

Key insights

Naively using VLM-generated captions can degrade accuracy due to anchoring effects and asymmetric error types.

Principles

Method

GEASS performs dual-path inference, combining clean and caption-augmented logits. It uses a confidence gate, an information-gain weight, and a disagreement penalty to adaptively regulate caption influence at the logit level.

In practice

Topics

Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.