Towards Generalizable Robotic Manipulation in Dynamic Environments

2026-03-16 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

A new dataset and benchmark, DOMINO, has been introduced to address the limitations of Vision-Language-Action (VLA) models in dynamic robotic manipulation environments. Existing VLAs, which typically rely on single-frame observations, struggle with moving targets due to a lack of relevant datasets and insufficient spatiotemporal reasoning. DOMINO features 35 tasks of varying complexity, over 110,000 expert trajectories, and a multi-dimensional evaluation suite. Researchers used DOMINO to evaluate current VLAs, explore training strategies for dynamic awareness, and validate the generalizability of dynamic data. Additionally, they propose PUMA, a dynamics-aware VLA architecture that integrates scene-centric historical optical flow and specialized world queries to forecast object-centric future states. PUMA achieved a 6.3% absolute improvement in success rate over baseline models, demonstrating state-of-the-art performance.

Key takeaway

For research scientists developing robotic manipulation systems, the introduction of the DOMINO dataset and the PUMA architecture signals a critical shift towards addressing dynamic environments. You should consider integrating DOMINO for training and benchmarking your VLA models, as it fosters robust spatiotemporal representations that transfer effectively to both dynamic and static tasks, potentially improving overall system reliability and performance.

Key insights

Dynamic manipulation in robotics requires specialized datasets and architectures for effective spatiotemporal reasoning.

Principles

Dynamic data improves VLA generalizability.
Historical optical flow enhances dynamic awareness.

Method

PUMA integrates scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states, coupling history-aware perception with short-horizon prediction.

In practice

Utilize DOMINO dataset for dynamic VLA training.
Implement optical flow for dynamic scene understanding.

Topics

Robotic Manipulation
Vision-Language-Action Models
Dynamic Environments
DOMINO Dataset
PUMA Architecture

Code references

H-EmbodVis/DOMINO

Best for: Research Scientist, AI Researcher, AI Scientist, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.