See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs
Summary
The S2 (See Less, Specify More) framework addresses generalization bottlenecks in vision-language-action (VLA) models, which often struggle with distractors, appearance shifts, and inferring local execution details from coarse instructions. S2 enhances VLA generalization by training the executor with a cleaner interface. "Specify More" refines trajectories into subtask-level language, disambiguating execution modes while maintaining the original high-level goal. "See Less" introduces an explicit visual evidence budget, compelling the executor to act based on task-sufficient visual information rather than unconstrained context, without requiring region or mask annotations. This approach allows the executor to follow precise guidance, reducing reliance on distracting visual patches and avoiding ambiguity. S2 is compatible with existing VLM planners through in-context learning. Across eight real-robot tasks on TX-G2 and HSR, S2 significantly improved mean subtask success from 54.2% to 79.0% compared to pi0.5.
Key takeaway
For Robotics Engineers developing vision-language-action (VLA) models, you should prioritize training executors with explicit, subtask-level guidance and constrained visual evidence. This approach, exemplified by S2, significantly boosts generalization by reducing reliance on ambiguous weak supervision and broad visual context. Implement refined trajectory relabeling and visual evidence budgeting to improve real-robot task success, as demonstrated by the 79.0% success rate on TX-G2 and HSR.
Key insights
VLA generalization improves by training executors with explicit local guidance and task-sufficient visual evidence, avoiding weak supervision.
Principles
- Coarse instructions induce supervision aliasing.
- Local guidance outperforms instruction replacement.
- Evidence budgeting reduces broad visual context dependence.
Method
The S2 framework refines trajectories into subtask-level language ("Specify More") and imposes an explicit visual evidence budget during executor training ("See Less") to improve VLA generalization.
In practice
- Refine VLA trajectories into subtask-level language.
- Apply explicit visual evidence budgets.
- Integrate with VLM planners via in-context learning.
Topics
- Vision-Language-Action Models
- Robotics Generalization
- Visual Evidence Budgeting
- Executor Training
- In-Context Learning
- AgiBot G2
Best for: Research Scientist, AI Scientist, Robotics Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.