Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs
Summary
A comprehensive evaluation of seventeen Multimodal Reasoning Models (MRMs) and Multimodal Language Models (MLMs) across thirteen visual spatial reasoning benchmarks reveals that Chain-of-Thought (CoT) prompting consistently degrades performance in spatial tasks. Contrary to its success in mathematical and logical domains, CoT prompting lowered accuracy by an average of 3% across diverse MLMs, with six of eight evaluated open-source MRMs performing better without CoT. The study, including models like GThinker, Vision-R1, ViGoRL, and Qwen3-VL, found that MRMs often underperform their own backbones on these benchmarks. A novel "No-Image++" ablation further demonstrated that CoT-prompted models suffer from severe shortcut learning, hallucinating visual details from textual priors even when presented with a blank image, indicating a critical need for vision-centric reasoning paradigms.
Key takeaway
For AI Engineers developing multimodal LLMs for visual spatial reasoning, you should reconsider the default application of Chain-of-Thought prompting. Your models may perform better with direct prompting for spatial tasks, as CoT can induce hallucination and shortcut learning. Focus on vision-centric training paradigms and consider integrating visual verifiers to ensure reasoning is grounded in actual image evidence, rather than relying solely on text-based reasoning chains.
Key insights
Chain-of-Thought prompting degrades visual spatial reasoning in multimodal LLMs due to shortcut learning and hallucination.
Principles
- Text-centric CoT is insufficient for robust spatial intelligence.
- Concise reasoning traces may mitigate performance degradation.
- MRMs often underperform their backbones on spatial tasks.
Method
Seventeen models were benchmarked across thirteen spatial datasets using uniform evaluation and scoring. A "No-Image++" ablation tested hallucination by replacing images with a blank gray screen and adding a "Cannot determine" option.
In practice
- Prioritize vision-centric training for spatial reasoning tasks.
- Implement test-time visual verifiers for reasoning steps.
- Develop visual process reward models for grounded reasoning.
Topics
- Multimodal Reasoning Models
- Chain-of-Thought Prompting
- Visual Spatial Reasoning
- Shortcut Learning
- Hallucinations
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.