Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

The MGSD framework addresses visual spatial planning challenges in vision-language models by tackling the perception-reasoning modality gap. This gap arises because visual planning requires inferring latent state structures from pixels and then reasoning over them, unlike symbolic planning which uses explicit objects. MGSD employs a two-stage modality-gap-aware self-distillation process. First, a cold-start grounding stage establishes reliable visual state representations. Second, a privileged teacher transfers planning capabilities via on-policy distillation, using explicit symbolic states to supervise the student's visual rollouts. Crucially, symbolic data is used only during training. Experiments show MGSD improves visual planning on 4B and 8B backbones by 19.3% and 18.4% macro average, respectively, narrowing the gap to symbolic-input upper bounds.

Key takeaway

For Machine Learning Engineers developing visual planning systems, if you are struggling with the perception-reasoning gap, MGSD offers a robust framework to improve performance. Consider implementing modality-gap-aware self-distillation to enhance both visual state recovery and multi-step planning in your models. This approach allows leveraging symbolic data during training without requiring it for inference, streamlining deployment.

Key insights

MGSD bridges the perception-reasoning modality gap in visual planning through a two-stage self-distillation framework.

Principles

Method

MGSD uses a cold-start grounding stage for visual state representation, followed by on-policy distillation from a privileged teacher using explicit symbolic states to transfer planning capabilities.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.