Let It Be Simple: One-Step Action Generation for Vision-Language-Action Models

2026-06-04 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

The paper "Let It Be Simple: One-Step Action Generation for Vision-Language-Action Models" challenges the conventional iterative denoising approach for diffusion-based vision-language-action (VLA) models, arguing that VLA action generation's condition-target structure is distinct from image synthesis. Unlike image generation, VLA policies are conditioned on rich observations, language, and state but predict only compact, low-dimensional actions. The authors propose a simplified method that biases the training time distribution toward high-noise states, utilizing standard velocity prediction without additional teacher models, distillation, or auxiliary objectives. This approach was validated across various robot-policy experiments, including LIBERO, LIBERO-Plus, and LIBERO-Pro benchmarks. Results show that one-step policies trained with high-noise biased schedules generally match ten-step decoding and can even surpass ten-step policies using a uniform time distribution on standard LIBERO. A 1.4B VLM model with a 30M action head achieved 95.6% on LIBERO-Long with one-step decoding, demonstrating that effective one-step VLA action generation can emerge from standard diffusion training.

Key takeaway

For Machine Learning Engineers optimizing Vision-Language-Action models, you can achieve robust one-step action generation without complex iterative denoising. Consider implementing high-noise biased training schedules for your diffusion-based VLA policies. This simplified approach, using standard velocity prediction, can match or even surpass the performance of ten-step decoding, significantly reducing inference latency and computational overhead for real-time robotic applications.

Key insights

Vision-Language-Action models can achieve strong one-step action generation by biasing diffusion training towards high-noise states.

Principles

VLA action generation differs from image synthesis.
Compact action prediction simplifies diffusion needs.
Biasing noise distribution improves one-step policies.

Method

Bias diffusion training schedules toward high-noise states using standard velocity prediction, avoiding teacher models, distillation, or auxiliary objectives.

In practice

Implement high-noise biased training schedules.
Evaluate one-step policies on robot control tasks.
Test simplified diffusion for VLA models.

Topics

Vision-Language-Action Models
Diffusion Models
Robot Policy Learning
One-Step Action Generation
Machine Learning
Robotics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.