Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation

2026-06-04 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, medium

Summary

The MGSD framework addresses the challenges vision-language models face in visual spatial planning, specifically the perception-reasoning modality gap that hinders visual state recovery and multi-step planning. This two-stage modality-gap-aware self-distillation method first employs a cold-start grounding stage to establish reliable visual state representations, reducing early perception noise. Subsequently, a privileged teacher transfers planning capabilities through on-policy distillation, utilizing explicit symbolic states to supervise the student model's visual rollout prefixes. Crucially, symbolic data is used exclusively during training, ensuring inference remains purely visual. Experiments on visual planning benchmarks demonstrate that MGSD consistently improves performance across both 4B and 8B backbones, achieving macro average increases of 19.3% and 18.4%, respectively, and narrowing the gap to symbolic-input upper bounds.

Key takeaway

For AI scientists developing visual spatial planning solutions, you should consider implementing a two-stage self-distillation approach like MGSD. This method effectively overcomes the perception-reasoning modality gap by using symbolic data during training to enhance visual state recovery and multi-step planning, without requiring symbolic input at inference. Adopting this framework can significantly boost your model's performance on visual planning benchmarks, as demonstrated by the 19.3% and 18.4% improvements on 4B and 8B backbones.

Key insights

Visual spatial planning benefits from self-distillation that bridges the perception-reasoning modality gap.

Principles

Visual planning has dual bottlenecks: state recovery and multi-step reasoning.
Symbolic data can supervise visual planning without being used at inference.
Reducing the modality gap improves both perception and optimal-path reasoning.

Method

MGSD uses a cold-start grounding stage for visual state representation, followed by on-policy distillation from a privileged teacher using explicit symbolic states to supervise visual rollouts.

In practice

Apply two-stage self-distillation to improve visual planning.
Leverage symbolic data for training-only supervision.
Focus on enhancing both visual state recovery and optimal-path reasoning.

Topics

Visual Spatial Planning
Self-Distillation
Modality Gap
Vision-Language Models
Symbolic Planning
Machine Learning

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.