Retrieve, Don't Retrain: Extending Vision Language Action Models to New Tasks at Test Time
Summary
A new retrieval-augmented policy extends Vision-Language-Action (VLA) models to novel tasks at test time, eliminating the need for costly per-task fine-tuning and extensive data collection. This policy is trained once on paired demonstrations from a target embodiment (query) and a cheaper embodiment (pool, e.g., human-hand video), then frozen. New tasks are integrated by simply appending pool-side demonstrations to a retrieval pool, allowing the frozen policy to condition on retrieved trajectories at each control step. This approach means parameter updates are only required for entirely new, unseen embodiments, not for every new task. The method enhances various VLA policies, showing particular effectiveness with Cosmos Policy, a video-generation-based world-action model, where retrieval provides high-level motion priors. It demonstrated improved cross-embodiment generalization on PushT for unseen goal angles and outperformed baselines on unseen tasks on RoboTwin 2.0, also proving viable on a real robot.
Key takeaway
For Robotics Engineers developing Vision-Language-Action policies, this retrieval-augmented approach fundamentally alters how you adapt models to new tasks. You can now extend policy capabilities by simply adding demonstrations to a retrieval pool, rather than incurring significant data collection and compute costs for per-task fine-tuning. This allows rapid deployment of VLA models to novel scenarios, streamlining your development workflow and reducing operational overhead.
Key insights
Adapting Vision-Language-Action models to new tasks can be achieved via demonstration retrieval, not costly per-task fine-tuning.
Principles
- Retrieval can replace parameter updates for task adaptation.
- Paired demonstrations enable cross-embodiment generalization.
- Future-image objectives enhance retrieval-conditioned actions.
Method
Train a retrieval-augmented policy once on paired target/pool demonstrations, then freeze it. Add new tasks by appending pool-side demonstrations to a retrieval pool, allowing the policy to condition on retrieved trajectories.
In practice
- Use human-hand video as a cheaper demonstration pool.
- Apply retrieval to extend VLA policies to unseen goal angles.
- Integrate into WAMs for enhanced visual consistency.
Topics
- Vision-Language-Action Models
- Retrieval-Augmented Policies
- Robotics
- Cross-Embodiment Generalization
- Task Adaptation
- Cosmos Policy
Best for: Research Scientist, AI Scientist, Robotics Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.