Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation
Summary
The MGSD framework addresses visual spatial planning challenges in vision-language models by tackling the perception-reasoning modality gap. This gap arises because visual planning requires inferring latent state structures from pixels and then reasoning over them, unlike symbolic planning which uses explicit objects. MGSD employs a two-stage modality-gap-aware self-distillation process. First, a cold-start grounding stage establishes reliable visual state representations. Second, a privileged teacher transfers planning capabilities via on-policy distillation, using explicit symbolic states to supervise the student's visual rollouts. Crucially, symbolic data is used only during training. Experiments show MGSD improves visual planning on 4B and 8B backbones by 19.3% and 18.4% macro average, respectively, narrowing the gap to symbolic-input upper bounds.
Key takeaway
For Machine Learning Engineers developing visual planning systems, if you are struggling with the perception-reasoning gap, MGSD offers a robust framework to improve performance. Consider implementing modality-gap-aware self-distillation to enhance both visual state recovery and multi-step planning in your models. This approach allows leveraging symbolic data during training without requiring it for inference, streamlining deployment.
Key insights
MGSD bridges the perception-reasoning modality gap in visual planning through a two-stage self-distillation framework.
Principles
- Symbolic data can supervise visual planning without being used at inference.
- Addressing perception and reasoning bottlenecks separately improves visual planning.
Method
MGSD uses a cold-start grounding stage for visual state representation, followed by on-policy distillation from a privileged teacher using explicit symbolic states to transfer planning capabilities.
In practice
- Employ self-distillation to bridge modality gaps in visual tasks.
- Use symbolic data during training to enhance visual planning capabilities.
Topics
- Visual Spatial Planning
- Vision-Language Models
- Self-Distillation
- Modality Gap
- Symbolic Planning
- State Representation
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.