Let It Be Simple: One-Step Action Generation for Vision-Language-Action Models
Summary
The paper "Let It Be Simple: One-Step Action Generation for Vision-Language-Action Models" proposes a simplified approach for diffusion-based vision-language-action (VLA) models, arguing that VLA action generation's condition-target structure differs from image synthesis, enabling effective one-step action prediction. Instead of iterative denoising, the method biases the training time distribution towards high-noise states, avoiding advanced one-step image synthesis techniques like teacher models or auxiliary objectives. Experiments on a controlled MNIST task and extensive robot-policy benchmarks, including LIBERO, LIBERO-Plus, and LIBERO-Pro, demonstrate that one-step policies trained with this high-noise biased schedule generally match ten-step decoding performance. On standard LIBERO, these policies even surpass ten-step policies trained with a uniform time distribution. A real-robot bimanual YAM RSS evaluation further supports this trend. Notably, a 1.4B VLM model with a 30M action head achieved 95.6% on LIBERO-Long using one-step decoding.
Key takeaway
For Machine Learning Engineers optimizing vision-language-action (VLA) model inference, you should reconsider the necessity of multi-step diffusion. By simply biasing your diffusion training schedules towards high-noise states, you can achieve strong one-step action generation performance, matching or exceeding ten-step decoding. This approach significantly reduces computational overhead without requiring complex distillation or auxiliary objectives, making your VLA deployments more efficient for real-time robotic control.
Key insights
VLA action generation's unique structure allows effective one-step prediction by biasing diffusion training towards high-noise states.
Principles
- VLA action generation has an asymmetric condition-target structure.
- High-noise biased training enables strong one-step VLA policies.
Method
Bias the training time distribution of diffusion models toward high-noise states, using standard velocity prediction, without complex distillation or auxiliary objectives.
In practice
- Implement one-step VLA policies to match ten-step decoding performance.
- Achieve 95.6% success on LIBERO-Long with 1.4B VLM and 30M action head.
Topics
- Vision-Language-Action Models
- Diffusion Models
- One-Step Action Generation
- Robot Policy
- LIBERO Benchmark
- Inference Efficiency
Code references
Best for: Research Scientist, AI Scientist, Robotics Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.