Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, long

Summary

Vision-language models (VLMs) face challenges in visual spatial planning due to a "perception–reasoning modality gap," requiring them to infer latent states from pixels and then reason for actions. Researchers from Tsinghua University and The Hong Kong University of Science and Technology propose MGSD, a two-stage modality-gap-aware self-distillation framework. MGSD first uses a cold-start grounding stage to equip the visual student with reliable state representations, followed by a privileged teacher transferring planning capabilities via on-policy distillation using explicit symbolic states. Symbolic data is used strictly during training, enabling purely visual inference. Experiments on visual planning benchmarks show MGSD consistently improves visual planning across 4B and 8B backbones, raising the macro average by 19.3% and 18.4%, respectively, and narrowing the gap to symbolic-input upper bounds.

Key takeaway

For AI Engineers developing visual planning systems, this research suggests a powerful strategy to overcome the perception–reasoning modality gap. You should consider implementing a two-stage self-distillation approach, starting with perception-oriented supervised fine-tuning to establish robust visual state recovery, then employing symbolic-guided on-policy distillation. This method allows your models to leverage explicit symbolic planning knowledge during training without requiring it at inference, significantly boosting task success and optimal-path reasoning.

Key insights

Bridging the perception-reasoning modality gap in visual planning improves VLM performance by leveraging symbolic data during training.

Principles

Method

MGSD employs two stages: perception-oriented SFT for state grounding, then symbolic-guided on-policy self-distillation to transfer planning behavior using a frozen text-only teacher.

In practice

Topics

Code references

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.