Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models
Summary
Imaginative Perception Tokens (IPT) are introduced as intermediate perceptual representations designed to enhance spatial reasoning in Vision Language Models (VLMs) where critical information is unobservable. VLMs typically struggle with tasks requiring "imaginative perception," such as inferring unseen viewpoints or tracing paths through occluded spaces. IPTs externalize what a VLM would perceive under alternative spatial configurations, consistent with observed input, without generating images at inference time. To evaluate this, three tasks—Perspective Taking (PET), Path Tracing (PT), and Multiview Counting (MVC)—were formulated, accompanied by approximately 20K examples with ground truth imaginations. Using the BAGEL VLM, IPT supervision consistently improved spatial reasoning, outperforming textual chain of thought. Specifically, IPT boosted MVC accuracy by 3.4% and achieved competitive performance on PT against strong closed-source models. Combining IPT with label-only supervision further enhanced gains, suggesting textual chain of thought can degrade performance due to modality mismatch.
Key takeaway
For Machine Learning Engineers developing VLMs for complex spatial tasks, consider integrating Imaginative Perception Tokens (IPTs). Your models can achieve superior spatial reasoning, especially with occluded information, outperforming traditional textual chain of thought methods. This approach improves generalization and provides interpretable intermediate representations. You should explore combining IPTs with label-only supervision to maximize performance gains, while carefully evaluating the impact of textual chain of thought on spatial computation.
Key insights
Imaginative Perception Tokens (IPT) improve VLM spatial reasoning by externalizing unobserved perceptual information.
Principles
- Spatial reasoning benefits from explicit imaginative perception.
- Intermediate perceptual representations enhance VLM generalization.
Method
Introduce Imaginative Perception Tokens (IPT) as intermediate perceptual representations. Supervise IPTs with ground truth imaginations for tasks like PET, PT, MVC. Integrate IPTs into a VLM backbone (e.g., BAGEL) for training.
In practice
- Use IPTs to improve VLM performance on occluded spatial tasks.
- Combine IPT with label-only supervision for further gains.
- Avoid textual chain of thought for complex spatial reasoning.
Topics
- Imaginative Perception Tokens
- Vision Language Models
- Spatial Reasoning
- Multimodal AI
- Perceptual Representations
- BAGEL VLM
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.