Beyond Shortcuts: Mitigating Visual Illusions in Frozen VLMs via Qualitative Reasoning

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Image Processing · Depth: Expert, long

Summary

A new training-free framework, Structured Qualitative Inference (SQI), has been developed to enhance the perceptual robustness of frozen Vision-Language Models (VLMs) against optical illusions. VLMs often fail when confronted with illusions due to shortcut heuristics, such as prioritizing linguistic priors over visual evidence, leading to metric hallucination, background interference, and confirmation bias. SQI addresses these issues through three modules: Axiomatic Constraint Injection, which suppresses erroneous quantitative estimations; Hierarchical Scene Decomposition, which isolates target visual elements from distractors; and Counterfactual Self-Verification, which mitigates confirmation bias. Evaluated on the DataCV 2026 Challenge (Task I: Classic Illusion Understanding), SQI achieved 2nd place overall, demonstrating significant accuracy improvements across diverse illusion categories and providing superior diagnostic interpretability without requiring any model fine-tuning.

Key takeaway

For Research Scientists developing robust vision-language systems, SQI offers a compelling, training-free approach to mitigate VLM vulnerabilities to optical illusions. You should consider integrating structured qualitative inference techniques to improve perceptual grounding and diagnostic interpretability, especially when working with frozen VLM backbones where fine-tuning is not feasible. This method can enhance model reliability by addressing reasoning-level failures rather than just visual representations.

Key insights

Structured qualitative reasoning at inference time improves VLM robustness to visual illusions without fine-tuning.

Principles

Method

SQI applies Axiomatic Constraint Injection to suppress metric hallucinations, Hierarchical Scene Decomposition to isolate targets, and Counterfactual Self-Verification to mitigate confirmation bias, all at inference time.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.