Real-Time Execution with Autoregressive Policies

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Software Development & Engineering · Depth: Expert, extended

Summary

A new approach, REALFAST, enables real-time execution for autoregressive Vision-Language-Action (VLA) models, addressing their inherent slower rollout speed compared to diffusion policies. This method tailors existing autoregressive policies by adjusting the tokenization horizon to H=2m, applying tokenization on m horizons for prefix conditioning, and utilizing constrained decoding to guarantee strict latency bounds (d_m <= m). It also employs multi-trajectory decoding to maximize performance by leveraging idle computation time. Experiments in simulated (LIBERO) and real-world (DROID) environments, using an RTX 4090 GPU, demonstrate that REALFAST consistently outperforms equivalent flow-matching policies like pi_0 with real-time action chunking (RTC). For pi_0-FAST, it achieves task success rates and rollout speeds comparable to pi_0.5, with reconstruction MSEs of approximately 3e-4 (LIBERO) and 5e-4 (DROID) for 400 ms action chunks. The approach reaches optimal performance within 6k training steps and supports multi-trajectory decoding with N=1,2,4.

Key takeaway

For Machine Learning Engineers developing real-time robotic control with Vision-Language-Action models, you should consider autoregressive policies as a competitive option. This work demonstrates that by implementing constrained decoding and multi-trajectory decoding, you can achieve robust real-time performance, even outperforming some flow-matching counterparts. Prioritize policy robustness over aggressively minimizing action chunk horizons, as this can lead to better task success rates in dynamic environments. Explore fine-tuning existing autoregressive models like pi_0-FAST with these techniques.

Key insights

Autoregressive policies can achieve real-time VLA execution through tokenization horizon adjustment and constrained multi-trajectory decoding.

Principles

Real-time execution requires a continuous action stream, not just low latency.
Policy robustness can outweigh minimal latency for near-future actions.
Multi-trajectory decoding improves performance by utilizing idle compute.

Method

Tailor autoregressive policies by setting action horizon H=2m, tokenizing m-length chunks, applying constrained decoding for d_m <= m, and using multi-trajectory decoding.

In practice

Use constrained decoding to ensure action chunk detokenization validity.
Implement multi-trajectory decoding with KV cache sharing for efficiency.
Prioritize policy robustness over aggressively reducing m for faster updates.

Topics

Autoregressive Policies
Vision-Language-Action Models
Real-time Execution
Constrained Decoding
Multi-trajectory Decoding
Robot Manipulation

Code references

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.