Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation

· Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, medium

Summary

The MGSD framework addresses the challenges vision-language models face in visual spatial planning, specifically the perception-reasoning modality gap that hinders visual state recovery and multi-step planning. This two-stage modality-gap-aware self-distillation method first employs a cold-start grounding stage to establish reliable visual state representations, reducing early perception noise. Subsequently, a privileged teacher transfers planning capabilities through on-policy distillation, utilizing explicit symbolic states to supervise the student model's visual rollout prefixes. Crucially, symbolic data is used exclusively during training, ensuring inference remains purely visual. Experiments on visual planning benchmarks demonstrate that MGSD consistently improves performance across both 4B and 8B backbones, achieving macro average increases of 19.3% and 18.4%, respectively, and narrowing the gap to symbolic-input upper bounds.

Key takeaway

For AI scientists developing visual spatial planning solutions, you should consider implementing a two-stage self-distillation approach like MGSD. This method effectively overcomes the perception-reasoning modality gap by using symbolic data during training to enhance visual state recovery and multi-step planning, without requiring symbolic input at inference. Adopting this framework can significantly boost your model's performance on visual planning benchmarks, as demonstrated by the 19.3% and 18.4% improvements on 4B and 8B backbones.

Key insights

Visual spatial planning benefits from self-distillation that bridges the perception-reasoning modality gap.

Principles

Method

MGSD uses a cold-start grounding stage for visual state representation, followed by on-policy distillation from a privileged teacher using explicit symbolic states to supervise visual rollouts.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.