Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation
Summary
Vision-language models (VLMs) face challenges in visual spatial planning due to a "perception–reasoning modality gap," requiring them to infer latent states from pixels and then reason for actions. Researchers from Tsinghua University and The Hong Kong University of Science and Technology propose MGSD, a two-stage modality-gap-aware self-distillation framework. MGSD first uses a cold-start grounding stage to equip the visual student with reliable state representations, followed by a privileged teacher transferring planning capabilities via on-policy distillation using explicit symbolic states. Symbolic data is used strictly during training, enabling purely visual inference. Experiments on visual planning benchmarks show MGSD consistently improves visual planning across 4B and 8B backbones, raising the macro average by 19.3% and 18.4%, respectively, and narrowing the gap to symbolic-input upper bounds.
Key takeaway
For AI Engineers developing visual planning systems, this research suggests a powerful strategy to overcome the perception–reasoning modality gap. You should consider implementing a two-stage self-distillation approach, starting with perception-oriented supervised fine-tuning to establish robust visual state recovery, then employing symbolic-guided on-policy distillation. This method allows your models to leverage explicit symbolic planning knowledge during training without requiring it at inference, significantly boosting task success and optimal-path reasoning.
Key insights
Bridging the perception-reasoning modality gap in visual planning improves VLM performance by leveraging symbolic data during training.
Principles
- Visual planning has a dual perception-reasoning bottleneck.
- Symbolic states can provide privileged training supervision.
- On-policy distillation corrects errors along the planning chain.
Method
MGSD employs two stages: perception-oriented SFT for state grounding, then symbolic-guided on-policy self-distillation to transfer planning behavior using a frozen text-only teacher.
In practice
- Use structured perception QA for cold-start SFT.
- Employ symbolic states as privileged teacher context.
- Apply reverse-KL-style loss on student rollouts.
Topics
- Visual Spatial Planning
- Modality Gap
- Self-Distillation
- On-Policy Distillation
- Vision-Language Models
- Symbolic State Supervision
Code references
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.