See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs

· Source: Artificial Intelligence · Field: Technology & Digital — Robotics & Autonomous Systems, Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

The S2 (See Less, Specify More) framework addresses generalization bottlenecks in vision-language-action (VLA) models, which often struggle with distractors, appearance shifts, and inferring local execution details from coarse instructions. S2 enhances VLA generalization by training the executor with a cleaner interface. "Specify More" refines trajectories into subtask-level language, disambiguating execution modes while maintaining the original high-level goal. "See Less" introduces an explicit visual evidence budget, compelling the executor to act based on task-sufficient visual information rather than unconstrained context, without requiring region or mask annotations. This approach allows the executor to follow precise guidance, reducing reliance on distracting visual patches and avoiding ambiguity. S2 is compatible with existing VLM planners through in-context learning. Across eight real-robot tasks on TX-G2 and HSR, S2 significantly improved mean subtask success from 54.2% to 79.0% compared to pi0.5.

Key takeaway

For Robotics Engineers developing vision-language-action (VLA) models, you should prioritize training executors with explicit, subtask-level guidance and constrained visual evidence. This approach, exemplified by S2, significantly boosts generalization by reducing reliance on ambiguous weak supervision and broad visual context. Implement refined trajectory relabeling and visual evidence budgeting to improve real-robot task success, as demonstrated by the 79.0% success rate on TX-G2 and HSR.

Key insights

VLA generalization improves by training executors with explicit local guidance and task-sufficient visual evidence, avoiding weak supervision.

Principles

Method

The S2 framework refines trajectories into subtask-level language ("Specify More") and imposes an explicit visual evidence budget during executor training ("See Less") to improve VLA generalization.

In practice

Topics

Best for: Research Scientist, AI Scientist, Robotics Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.