ActionMap: Robot Policy Learning via Voxel Action Heatmap

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

ActionMap introduces a novel voxel heatmap action head designed to enhance Vision-Language-Action (VLA) models by replacing conventional single-point action decoders. Unlike existing methods that treat the action space as unstructured, ActionMap predicts a voxel heatmap, directly storing action probabilities. This drop-in replacement significantly improves performance across various benchmarks. In LIBERO simulation, it boosts success rates by +8.2% over OpenVLA-OFT's L1 regression head and +1.6% over pi0.5's flow-matching head. It also demonstrates superior data efficiency, maintaining a +26.0% lead over OpenVLA-OFT at 10% training data, and achieves faster convergence. On real-world Franka manipulation, ActionMap outperforms OpenVLA-OFT, winning 20/30 trials versus 7/30 and reducing end-effector grasp position error by 2 to 3 times. These consistent gains across distinct VLA backbones highlight action representation as a critical factor for performance.

Key takeaway

For Robotics Engineers or ML Engineers developing VLA models, if you are struggling with action precision, data efficiency, or slow training convergence, consider integrating ActionMap's voxel heatmap action head. This approach can significantly improve your robot's manipulation success rates and end-effector positioning accuracy. Your existing VLA backbone can benefit from this drop-in replacement, accelerating development and reducing data requirements for robust real-world performance.

Key insights

Spatially structured voxel heatmaps for robot actions significantly improve VLA model performance and data efficiency.

Principles

Method

ActionMap replaces a VLA's native action decoder, predicting three voxel heatmaps (translation, rotation, gripper). It trains against softmax-normalized Gaussian blobs via cross-entropy and decodes using top-k soft argmax.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.