ActionMap: Robot Policy Learning via Voxel Action Heatmap

2026-06-08 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

ActionMap introduces a novel voxel heatmap action head designed to enhance Vision-Language-Action (VLA) models by replacing conventional single-point action decoders. Unlike existing methods that treat the action space as unstructured, ActionMap predicts a voxel heatmap, directly storing action probabilities. This drop-in replacement significantly improves performance across various benchmarks. In LIBERO simulation, it boosts success rates by +8.2% over OpenVLA-OFT's L1 regression head and +1.6% over pi0.5's flow-matching head. It also demonstrates superior data efficiency, maintaining a +26.0% lead over OpenVLA-OFT at 10% training data, and achieves faster convergence. On real-world Franka manipulation, ActionMap outperforms OpenVLA-OFT, winning 20/30 trials versus 7/30 and reducing end-effector grasp position error by 2 to 3 times. These consistent gains across distinct VLA backbones highlight action representation as a critical factor for performance.

Key takeaway

For Robotics Engineers or ML Engineers developing VLA models, if you are struggling with action precision, data efficiency, or slow training convergence, consider integrating ActionMap's voxel heatmap action head. This approach can significantly improve your robot's manipulation success rates and end-effector positioning accuracy. Your existing VLA backbone can benefit from this drop-in replacement, accelerating development and reducing data requirements for robust real-world performance.

Key insights

Spatially structured voxel heatmaps for robot actions significantly improve VLA model performance and data efficiency.

Principles

Action space possesses rich geometric structure.
Soft distributional outputs naturally capture spatial structure.
Heatmap-based classification optimizes spatial prediction cleanly.

Method

ActionMap replaces a VLA's native action decoder, predicting three voxel heatmaps (translation, rotation, gripper). It trains against softmax-normalized Gaussian blobs via cross-entropy and decodes using top-k soft argmax.

In practice

Integrate voxel heatmap heads into existing VLAs.
Utilize Gaussian-blob targets for soft-label supervision.
Employ top-k soft argmax for continuous action decoding.

Topics

Voxel Heatmaps
Robot Manipulation
Vision-Language-Action Models
ActionMap
Data Efficiency
Robotic Control

Code references

showlab/ActionMap

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.