Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation
Summary
The MGSD framework addresses the challenges vision-language models face in visual spatial planning, specifically the perception-reasoning modality gap that hinders visual state recovery and multi-step planning. This two-stage modality-gap-aware self-distillation method first employs a cold-start grounding stage to establish reliable visual state representations, reducing early perception noise. Subsequently, a privileged teacher transfers planning capabilities through on-policy distillation, utilizing explicit symbolic states to supervise the student model's visual rollout prefixes. Crucially, symbolic data is used exclusively during training, ensuring inference remains purely visual. Experiments on visual planning benchmarks demonstrate that MGSD consistently improves performance across both 4B and 8B backbones, achieving macro average increases of 19.3% and 18.4%, respectively, and narrowing the gap to symbolic-input upper bounds.
Key takeaway
For AI scientists developing visual spatial planning solutions, you should consider implementing a two-stage self-distillation approach like MGSD. This method effectively overcomes the perception-reasoning modality gap by using symbolic data during training to enhance visual state recovery and multi-step planning, without requiring symbolic input at inference. Adopting this framework can significantly boost your model's performance on visual planning benchmarks, as demonstrated by the 19.3% and 18.4% improvements on 4B and 8B backbones.
Key insights
Visual spatial planning benefits from self-distillation that bridges the perception-reasoning modality gap.
Principles
- Visual planning has dual bottlenecks: state recovery and multi-step reasoning.
- Symbolic data can supervise visual planning without being used at inference.
- Reducing the modality gap improves both perception and optimal-path reasoning.
Method
MGSD uses a cold-start grounding stage for visual state representation, followed by on-policy distillation from a privileged teacher using explicit symbolic states to supervise visual rollouts.
In practice
- Apply two-stage self-distillation to improve visual planning.
- Leverage symbolic data for training-only supervision.
- Focus on enhancing both visual state recovery and optimal-path reasoning.
Topics
- Visual Spatial Planning
- Self-Distillation
- Modality Gap
- Vision-Language Models
- Symbolic Planning
- Machine Learning
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.