HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

HiVLA is a novel visual-grounded-centric hierarchical framework designed for robotic manipulation, addressing the trade-off between VLM reasoning and fine-tuned control. It decouples high-level semantic planning from low-level motor control. The high-level component uses a VLM planner for task decomposition and visual grounding, generating structured plans with subtask instructions and target bounding boxes. For low-level execution, HiVLA employs a flow-matching Diffusion Transformer (DiT) action expert featuring a cascaded cross-attention mechanism. This mechanism integrates global context, high-resolution object-centric crops, and skill semantics, allowing the DiT to focus solely on robust action execution. This decoupled architecture maintains the VLM's zero-shot reasoning capabilities while enabling independent improvements to both parts. Experiments in both simulated and real-world environments show HiVLA significantly outperforms existing end-to-end baselines, particularly in long-horizon skill composition and precise manipulation of small objects within cluttered settings.

Key takeaway

For research scientists developing robotic manipulation systems, HiVLA offers a compelling architectural shift. You should consider adopting a decoupled hierarchical approach to preserve VLM reasoning while enhancing low-level control, especially when tackling long-horizon tasks or fine-grained object manipulation in complex environments. This framework suggests a path to more robust and adaptable robotic agents.

Key insights

Decoupling high-level VLM planning from low-level motor control improves robotic manipulation performance.

Principles

Method

A VLM planner generates subtask instructions and bounding boxes, which a Diffusion Transformer with cascaded cross-attention then translates into physical actions.

In practice

Topics

Best for: Research Scientist, AI Scientist, Robotics Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.