GEASS: Gated Evidence-Adaptive Selective Caption Trust for Vision-Language Models

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

GEASS (Gated Evidence-Aware Selective Steering) is a novel, training-free module designed to mitigate object hallucination in Vision-Language Models (VLMs) by adaptively controlling the influence of self-generated captions. Previous approaches often treat captions as uniformly positive, but this work reveals that naively embedding them can degrade VLM accuracy, exemplified by a nearly 10-point drop for Qwen2.5-VL-3B on HallusionBench (from 61.19 to 51.31). This degradation stems from a "deep anchoring effect" where captions reshape reasoning, and an "asymmetric error structure" where omissions are frequent but mild, while fabrications are rare but highly damaging. GEASS addresses this by performing two forward passes per query, using a confidence gate, an information-gain weight, and a disagreement penalty to selectively fuse caption logits. It consistently improves performance over vanilla inference and contrastive decoding on benchmarks like POPE and HallusionBench across models such as Qwen2.5-VL-3B and InternVL3-3.8B, with only two extra forward passes.

Key takeaway

For AI Scientists and Machine Learning Engineers working on Vision-Language Models, you should critically evaluate how auxiliary text, like self-generated captions, influences your models. Instead of unconditionally embedding captions, consider adaptive steering mechanisms like GEASS. This training-free approach mitigates object hallucination and improves accuracy on benchmarks such as HallusionBench. It offers a practical way to enhance VLM reliability without costly retraining or architectural changes.

Key insights

Naively using VLM-generated captions can degrade accuracy due to anchoring effects and asymmetric error types.

Principles

Captions exert a "deep anchoring effect" on VLM reasoning and lexical choices.
Caption errors are "structurally asymmetric": omissions are common but mild, fabrications are rare but highly impactful.
A caption's usefulness is a per-query property, not a per-corpus one.

Method

GEASS performs dual-path inference, combining clean and caption-augmented logits. It uses a confidence gate, an information-gain weight, and a disagreement penalty to adaptively regulate caption influence at the logit level.

In practice

GEASS is plug-and-play and requires no architectural modifications or retraining.
It adds only two extra forward passes per query, compatible with any VLM exposing decoding logits.

Topics

Vision-Language Models
Object Hallucination
Caption Steering
Inference-time Mitigation
Logit Fusion
Qwen2.5-VL-3B
HallusionBench

Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.