Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

A comprehensive evaluation of seventeen Multimodal Reasoning Models (MRMs) and Multimodal Language Models (MLMs) across thirteen visual spatial reasoning benchmarks reveals that Chain-of-Thought (CoT) prompting consistently degrades performance in spatial tasks. Contrary to its success in mathematical and logical domains, CoT prompting lowered accuracy by an average of 3% across diverse MLMs, with six of eight evaluated open-source MRMs performing better without CoT. The study, including models like GThinker, Vision-R1, ViGoRL, and Qwen3-VL, found that MRMs often underperform their own backbones on these benchmarks. A novel "No-Image++" ablation further demonstrated that CoT-prompted models suffer from severe shortcut learning, hallucinating visual details from textual priors even when presented with a blank image, indicating a critical need for vision-centric reasoning paradigms.

Key takeaway

For AI Engineers developing multimodal LLMs for visual spatial reasoning, you should reconsider the default application of Chain-of-Thought prompting. Your models may perform better with direct prompting for spatial tasks, as CoT can induce hallucination and shortcut learning. Focus on vision-centric training paradigms and consider integrating visual verifiers to ensure reasoning is grounded in actual image evidence, rather than relying solely on text-based reasoning chains.

Key insights

Chain-of-Thought prompting degrades visual spatial reasoning in multimodal LLMs due to shortcut learning and hallucination.

Principles

Method

Seventeen models were benchmarked across thirteen spatial datasets using uniform evaluation and scoring. A "No-Image++" ablation tested hallucination by replacing images with a blank gray screen and adding a "Cannot determine" option.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.