Let It Be Simple: One-Step Action Generation for Vision-Language-Action Models

2026-06-04 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, medium

Summary

The paper "Let It Be Simple: One-Step Action Generation for Vision-Language-Action Models" proposes a simplified approach for diffusion-based vision-language-action (VLA) models, arguing that VLA action generation's condition-target structure differs from image synthesis, enabling effective one-step action prediction. Instead of iterative denoising, the method biases the training time distribution towards high-noise states, avoiding advanced one-step image synthesis techniques like teacher models or auxiliary objectives. Experiments on a controlled MNIST task and extensive robot-policy benchmarks, including LIBERO, LIBERO-Plus, and LIBERO-Pro, demonstrate that one-step policies trained with this high-noise biased schedule generally match ten-step decoding performance. On standard LIBERO, these policies even surpass ten-step policies trained with a uniform time distribution. A real-robot bimanual YAM RSS evaluation further supports this trend. Notably, a 1.4B VLM model with a 30M action head achieved 95.6% on LIBERO-Long using one-step decoding.

Key takeaway

For Machine Learning Engineers optimizing vision-language-action (VLA) model inference, you should reconsider the necessity of multi-step diffusion. By simply biasing your diffusion training schedules towards high-noise states, you can achieve strong one-step action generation performance, matching or exceeding ten-step decoding. This approach significantly reduces computational overhead without requiring complex distillation or auxiliary objectives, making your VLA deployments more efficient for real-time robotic control.

Key insights

VLA action generation's unique structure allows effective one-step prediction by biasing diffusion training towards high-noise states.

Principles

VLA action generation has an asymmetric condition-target structure.
High-noise biased training enables strong one-step VLA policies.

Method

Bias the training time distribution of diffusion models toward high-noise states, using standard velocity prediction, without complex distillation or auxiliary objectives.

In practice

Implement one-step VLA policies to match ten-step decoding performance.
Achieve 95.6% success on LIBERO-Long with 1.4B VLM and 30M action head.

Topics

Vision-Language-Action Models
Diffusion Models
One-Step Action Generation
Robot Policy
LIBERO Benchmark
Inference Efficiency

Code references

EmbodiedAI-RoboTron/CF-VLA

Best for: Research Scientist, AI Scientist, Robotics Engineer, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.