See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs

2026-06-01 · Source: Artificial Intelligence · Field: Technology & Digital — Robotics & Autonomous Systems, Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

The S2 (See Less, Specify More) framework addresses generalization bottlenecks in vision-language-action (VLA) models, which often struggle with distractors, appearance shifts, and inferring local execution details from coarse instructions. S2 enhances VLA generalization by training the executor with a cleaner interface. "Specify More" refines trajectories into subtask-level language, disambiguating execution modes while maintaining the original high-level goal. "See Less" introduces an explicit visual evidence budget, compelling the executor to act based on task-sufficient visual information rather than unconstrained context, without requiring region or mask annotations. This approach allows the executor to follow precise guidance, reducing reliance on distracting visual patches and avoiding ambiguity. S2 is compatible with existing VLM planners through in-context learning. Across eight real-robot tasks on TX-G2 and HSR, S2 significantly improved mean subtask success from 54.2% to 79.0% compared to pi0.5.

Key takeaway

For Robotics Engineers developing vision-language-action (VLA) models, you should prioritize training executors with explicit, subtask-level guidance and constrained visual evidence. This approach, exemplified by S2, significantly boosts generalization by reducing reliance on ambiguous weak supervision and broad visual context. Implement refined trajectory relabeling and visual evidence budgeting to improve real-robot task success, as demonstrated by the 79.0% success rate on TX-G2 and HSR.

Key insights

VLA generalization improves by training executors with explicit local guidance and task-sufficient visual evidence, avoiding weak supervision.

Principles

Coarse instructions induce supervision aliasing.
Local guidance outperforms instruction replacement.
Evidence budgeting reduces broad visual context dependence.

Method

The S2 framework refines trajectories into subtask-level language ("Specify More") and imposes an explicit visual evidence budget during executor training ("See Less") to improve VLA generalization.

In practice

Refine VLA trajectories into subtask-level language.
Apply explicit visual evidence budgets.
Integrate with VLM planners via in-context learning.

Topics

Vision-Language-Action Models
Robotics Generalization
Visual Evidence Budgeting
Executor Training
In-Context Learning
AgiBot G2

Best for: Research Scientist, AI Scientist, Robotics Engineer, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.