TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, medium

Summary

TempoVLA is a novel Vision-Language-Action (VLA) model designed to overcome the limitation of existing VLAs, which operate at a single fixed execution speed despite robot manipulation tasks requiring variable speeds for different phases. Traditional acceleration methods only shift policies to another fixed speed and largely ignore deceleration. TempoVLA addresses this by observing that predicted action magnitude directly controls robot movement speed. It integrates two components: Variable-Speed Trajectory Augmentation (VSTA), which re-times demonstration data to any target speed by merging or splitting actions while preserving motion semantics, and a model-side conditioning mechanism that explicitly feeds the desired speed to the policy. Experiments show VSTA achieves requested speeds with minimal motion error, and TempoVLA provides flexible bidirectional speed control in both simulated and real-world tasks. VSTA also enhances default 1x performance through improved data utilization, and TempoVLA can enable dynamic speed scheduling when paired with a large multimodal model.

Key takeaway

For Robotics Engineers deploying Vision-Language-Action models in real-world manipulation, you should integrate explicit speed control to enhance task performance and safety. TempoVLA's approach allows your robots to accelerate through low-risk transit phases and decelerate for high-risk contact stages, improving precision. Consider augmenting your VLA training data with variable-speed trajectories and conditioning your models with a scalar speed input to achieve this dynamic capability.

Key insights

Robot execution speed can be bidirectionally controlled by explicitly scaling predicted action magnitudes within Vision-Language-Action models.

Principles

Dynamic speed control is crucial for robot manipulation.
Action magnitude directly governs robot movement speed.
Variable-speed data augmentation boosts VLA performance.

Method

TempoVLA re-times demonstrations via VSTA (merging/splitting actions) and feeds a scalar speed condition to the policy, which scales predicted action magnitudes.

In practice

Augment VLA training data with variable-speed trajectories.
Implement explicit speed conditioning in VLA policies.
Use VLMs to dynamically schedule robot execution speeds.

Topics

Vision-Language-Action Models
Robot Manipulation
Speed Control
Data Augmentation
Dynamic Speed Scheduling
Multimodal AI

Best for: Research Scientist, AI Scientist, Robotics Engineer, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.