HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
Summary
HiVLA is a novel visual-grounded-centric hierarchical framework designed for robotic manipulation, addressing the trade-off between VLM reasoning and fine-tuned control. It decouples high-level semantic planning from low-level motor control. The high-level component uses a VLM planner for task decomposition and visual grounding, generating structured plans with subtask instructions and target bounding boxes. For low-level execution, HiVLA employs a flow-matching Diffusion Transformer (DiT) action expert featuring a cascaded cross-attention mechanism. This mechanism integrates global context, high-resolution object-centric crops, and skill semantics, allowing the DiT to focus solely on robust action execution. This decoupled architecture maintains the VLM's zero-shot reasoning capabilities while enabling independent improvements to both parts. Experiments in both simulated and real-world environments show HiVLA significantly outperforms existing end-to-end baselines, particularly in long-horizon skill composition and precise manipulation of small objects within cluttered settings.
Key takeaway
For research scientists developing robotic manipulation systems, HiVLA offers a compelling architectural shift. You should consider adopting a decoupled hierarchical approach to preserve VLM reasoning while enhancing low-level control, especially when tackling long-horizon tasks or fine-grained object manipulation in complex environments. This framework suggests a path to more robust and adaptable robotic agents.
Key insights
Decoupling high-level VLM planning from low-level motor control improves robotic manipulation performance.
Principles
- Preserve VLM zero-shot reasoning.
- Decouple planning from execution.
- Fuse multi-scale visual context.
Method
A VLM planner generates subtask instructions and bounding boxes, which a Diffusion Transformer with cascaded cross-attention then translates into physical actions.
In practice
- Apply hierarchical VLA for complex tasks.
- Use DiT for robust low-level control.
- Improve small object manipulation.
Topics
- HiVLA
- Hierarchical Robotic Manipulation
- Vision-Language Models
- Diffusion Transformer
- Visual Grounding
Best for: Research Scientist, AI Scientist, Robotics Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.