Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

2026-06-02 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

Imaginative Perception Tokens (IPT) are introduced as intermediate perceptual representations designed to enhance spatial reasoning in Vision Language Models (VLMs) where critical information is unobservable. VLMs typically struggle with tasks requiring "imaginative perception," such as inferring unseen viewpoints or tracing paths through occluded spaces. IPTs externalize what a VLM would perceive under alternative spatial configurations, consistent with observed input, without generating images at inference time. To evaluate this, three tasks—Perspective Taking (PET), Path Tracing (PT), and Multiview Counting (MVC)—were formulated, accompanied by approximately 20K examples with ground truth imaginations. Using the BAGEL VLM, IPT supervision consistently improved spatial reasoning, outperforming textual chain of thought. Specifically, IPT boosted MVC accuracy by 3.4% and achieved competitive performance on PT against strong closed-source models. Combining IPT with label-only supervision further enhanced gains, suggesting textual chain of thought can degrade performance due to modality mismatch.

Key takeaway

For Machine Learning Engineers developing VLMs for complex spatial tasks, consider integrating Imaginative Perception Tokens (IPTs). Your models can achieve superior spatial reasoning, especially with occluded information, outperforming traditional textual chain of thought methods. This approach improves generalization and provides interpretable intermediate representations. You should explore combining IPTs with label-only supervision to maximize performance gains, while carefully evaluating the impact of textual chain of thought on spatial computation.

Key insights

Imaginative Perception Tokens (IPT) improve VLM spatial reasoning by externalizing unobserved perceptual information.

Principles

Spatial reasoning benefits from explicit imaginative perception.
Intermediate perceptual representations enhance VLM generalization.

Method

Introduce Imaginative Perception Tokens (IPT) as intermediate perceptual representations. Supervise IPTs with ground truth imaginations for tasks like PET, PT, MVC. Integrate IPTs into a VLM backbone (e.g., BAGEL) for training.

In practice

Use IPTs to improve VLM performance on occluded spatial tasks.
Combine IPT with label-only supervision for further gains.
Avoid textual chain of thought for complex spatial reasoning.

Topics

Imaginative Perception Tokens
Vision Language Models
Spatial Reasoning
Multimodal AI
Perceptual Representations
BAGEL VLM

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.