Let It Be Simple: One-Step Action Generation for Vision-Language-Action Models
Summary
The paper "Let It Be Simple: One-Step Action Generation for Vision-Language-Action Models" challenges the conventional iterative denoising approach for diffusion-based vision-language-action (VLA) models, arguing that VLA action generation's condition-target structure is distinct from image synthesis. Unlike image generation, VLA policies are conditioned on rich observations, language, and state but predict only compact, low-dimensional actions. The authors propose a simplified method that biases the training time distribution toward high-noise states, utilizing standard velocity prediction without additional teacher models, distillation, or auxiliary objectives. This approach was validated across various robot-policy experiments, including LIBERO, LIBERO-Plus, and LIBERO-Pro benchmarks. Results show that one-step policies trained with high-noise biased schedules generally match ten-step decoding and can even surpass ten-step policies using a uniform time distribution on standard LIBERO. A 1.4B VLM model with a 30M action head achieved 95.6% on LIBERO-Long with one-step decoding, demonstrating that effective one-step VLA action generation can emerge from standard diffusion training.
Key takeaway
For Machine Learning Engineers optimizing Vision-Language-Action models, you can achieve robust one-step action generation without complex iterative denoising. Consider implementing high-noise biased training schedules for your diffusion-based VLA policies. This simplified approach, using standard velocity prediction, can match or even surpass the performance of ten-step decoding, significantly reducing inference latency and computational overhead for real-time robotic applications.
Key insights
Vision-Language-Action models can achieve strong one-step action generation by biasing diffusion training towards high-noise states.
Principles
- VLA action generation differs from image synthesis.
- Compact action prediction simplifies diffusion needs.
- Biasing noise distribution improves one-step policies.
Method
Bias diffusion training schedules toward high-noise states using standard velocity prediction, avoiding teacher models, distillation, or auxiliary objectives.
In practice
- Implement high-noise biased training schedules.
- Evaluate one-step policies on robot control tasks.
- Test simplified diffusion for VLA models.
Topics
- Vision-Language-Action Models
- Diffusion Models
- Robot Policy Learning
- One-Step Action Generation
- Machine Learning
- Robotics
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.