ATOM-Bench: A Real-World Benchmark for Atomic Skills and Compositional Generalization in Manipulation Policies
Summary
ATOM-Bench is a new real-world benchmark designed to diagnose generalization capabilities in generalist robotic manipulation policies by evaluating both atomic skills and compositional generalization. This benchmark factorizes tabletop manipulation into motor atoms and instruction atoms, featuring 30 atomic tasks and 24 held-out compositional tasks across paired single-arm and dual-arm robot tracks. Researchers collected 3,000 human demonstrations for atomic fine-tuning and performed 2,700 physical rollouts on five representative manipulation policies. ATOM-Bench introduces Atomic Score (AS) and Compositional Failure Share (CFS) to differentiate failure causes. Initial evaluations reveal that while current policies can acquire simple instruction-grounding skills, they struggle with fine-grained motor atoms, counting, and logical filtering. Crucially, strong performance on atomic tasks does not consistently translate to success on held-out compositional tasks.
Key takeaway
For Robotics Engineers developing generalist manipulation policies, you should prioritize rigorous evaluation beyond atomic task success. Your policies must demonstrate robust compositional generalization, as strong atomic performance does not reliably transfer. Utilize benchmarks like ATOM-Bench to diagnose whether failures stem from weak motor execution, poor instruction grounding, or limited compositional reuse, guiding your model improvements effectively.
Key insights
Current robotic manipulation policies struggle with fine motor skills and compositional generalization, even when atomic skills are strong.
Principles
- Manipulation can be factorized into motor and instruction atoms.
- Atomic skill proficiency does not guarantee compositional generalization.
- Failures can be diagnosed by motor, instruction, or compositional limits.
Method
ATOM-Bench evaluates policies by fine-tuning on 30 atomic tasks and testing on 24 held-out compositional tasks across single/dual-arm tracks, using Atomic Score (AS) and Compositional Failure Share (CFS) to diagnose failures.
In practice
- Benchmark generalist manipulation policies with ATOM-Bench.
- Diagnose policy failures in motor execution or instruction grounding.
- Fine-tune atomic skills using human demonstration data.
Topics
- Robotic Manipulation
- Generalization Benchmarks
- Atomic Skills
- Compositional Generalization
- Robot Learning
- Policy Evaluation
Best for: Research Scientist, AI Scientist, Robotics Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.