Real-Time Execution with Autoregressive Policies
Summary
A new approach, REALFAST, enables real-time execution for autoregressive Vision-Language-Action (VLA) models, addressing their inherent slower rollout speed compared to diffusion policies. This method tailors existing autoregressive policies by adjusting the tokenization horizon to H=2m, applying tokenization on m horizons for prefix conditioning, and utilizing constrained decoding to guarantee strict latency bounds (d_m <= m). It also employs multi-trajectory decoding to maximize performance by leveraging idle computation time. Experiments in simulated (LIBERO) and real-world (DROID) environments, using an RTX 4090 GPU, demonstrate that REALFAST consistently outperforms equivalent flow-matching policies like pi_0 with real-time action chunking (RTC). For pi_0-FAST, it achieves task success rates and rollout speeds comparable to pi_0.5, with reconstruction MSEs of approximately 3e-4 (LIBERO) and 5e-4 (DROID) for 400 ms action chunks. The approach reaches optimal performance within 6k training steps and supports multi-trajectory decoding with N=1,2,4.
Key takeaway
For Machine Learning Engineers developing real-time robotic control with Vision-Language-Action models, you should consider autoregressive policies as a competitive option. This work demonstrates that by implementing constrained decoding and multi-trajectory decoding, you can achieve robust real-time performance, even outperforming some flow-matching counterparts. Prioritize policy robustness over aggressively minimizing action chunk horizons, as this can lead to better task success rates in dynamic environments. Explore fine-tuning existing autoregressive models like pi_0-FAST with these techniques.
Key insights
Autoregressive policies can achieve real-time VLA execution through tokenization horizon adjustment and constrained multi-trajectory decoding.
Principles
- Real-time execution requires a continuous action stream, not just low latency.
- Policy robustness can outweigh minimal latency for near-future actions.
- Multi-trajectory decoding improves performance by utilizing idle compute.
Method
Tailor autoregressive policies by setting action horizon H=2m, tokenizing m-length chunks, applying constrained decoding for d_m <= m, and using multi-trajectory decoding.
In practice
- Use constrained decoding to ensure action chunk detokenization validity.
- Implement multi-trajectory decoding with KV cache sharing for efficiency.
- Prioritize policy robustness over aggressively reducing m for faster updates.
Topics
- Autoregressive Policies
- Vision-Language-Action Models
- Real-time Execution
- Constrained Decoding
- Multi-trajectory Decoding
- Robot Manipulation
Code references
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.