VISTA: Vision-Grounded and Physics-Validated Adaptation of UMI data for VLA Training

· Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

The VISTA framework addresses critical challenges in training large-scale Vision-Language-Action (VLA) models using Universal Manipulation Interface (UMI) data. It tackles two primary mismatches: wrist-mounted fisheye camera views, which are out-of-distribution for pretrained Vision-Language Models (VLMs) due to severe radial distortion and local gripper-centric perspectives, and human-collected trajectories that frequently violate physical limits, leading to infeasible actions. VISTA integrates three components: UMI-VQA, a novel large-scale VQA dataset specifically for fisheye observations, which aligns VLM representations; a systematic physical-validation pipeline that pre-checks data completeness and scores trajectories for continuity, self-collision risk, and execution fidelity; and a two-stage co-training recipe that jointly learns vision-language grounding and action prediction. Experiments demonstrate VISTA's consistent improvement in policy performance and that its physical-validation scores accurately predict deployment success, outperforming baselines like $π_{0.5}$, LingBot-VLA, and Wall-X on various manipulation tasks.

Key takeaway

For robotics engineers developing Vision-Language-Action models with UMI data, you must account for visual distortions from fisheye cameras and physical infeasibilities in human-collected trajectories. Implement a physical-validation pipeline to filter out problematic data, ensuring your training set leads to deployable policies. Additionally, consider using auxiliary vision-language supervision with datasets like UMI-VQA to align VLM representations, significantly improving downstream policy performance and deployment success.

Key insights

VISTA bridges visual and physical data mismatches in UMI for robust VLA model training.

Principles

Method

VISTA employs auxiliary VQA supervision for visual alignment, systematic physical validation of trajectories for continuity and collision risk, and a two-stage co-training recipe.

In practice

Topics

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.