Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation

2026-06-04 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

The MGSD framework addresses visual spatial planning challenges in vision-language models by tackling the perception-reasoning modality gap. This gap arises because visual planning requires inferring latent state structures from pixels and then reasoning over them, unlike symbolic planning which uses explicit objects. MGSD employs a two-stage modality-gap-aware self-distillation process. First, a cold-start grounding stage establishes reliable visual state representations. Second, a privileged teacher transfers planning capabilities via on-policy distillation, using explicit symbolic states to supervise the student's visual rollouts. Crucially, symbolic data is used only during training. Experiments show MGSD improves visual planning on 4B and 8B backbones by 19.3% and 18.4% macro average, respectively, narrowing the gap to symbolic-input upper bounds.

Key takeaway

For Machine Learning Engineers developing visual planning systems, if you are struggling with the perception-reasoning gap, MGSD offers a robust framework to improve performance. Consider implementing modality-gap-aware self-distillation to enhance both visual state recovery and multi-step planning in your models. This approach allows leveraging symbolic data during training without requiring it for inference, streamlining deployment.

Key insights

MGSD bridges the perception-reasoning modality gap in visual planning through a two-stage self-distillation framework.

Principles

Symbolic data can supervise visual planning without being used at inference.
Addressing perception and reasoning bottlenecks separately improves visual planning.

Method

MGSD uses a cold-start grounding stage for visual state representation, followed by on-policy distillation from a privileged teacher using explicit symbolic states to transfer planning capabilities.

In practice

Employ self-distillation to bridge modality gaps in visual tasks.
Use symbolic data during training to enhance visual planning capabilities.

Topics

Visual Spatial Planning
Vision-Language Models
Self-Distillation
Modality Gap
Symbolic Planning
State Representation

Code references

Oranger-l/MGSD

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.