ActQuant: Sub-4-bit Action-Guided Quantization for Vision-Language-Action Models
Summary
ActQuant is a novel action-guided mixed-precision post-training quantization (PTQ) framework designed for Vision-Language-Action (VLA) models, enabling their deployment on edge platforms by achieving sub-4-bit weight quantization. It employs a two-stage process: an inter-tensor bit allocator assigns bit-widths based on action contribution, and an intra-tensor scale optimizer tunes per-block scales using action-aware curvature. Coupled with OmniModel.cpp, an agentic conversion pipeline for native C/C++ runtime, ActQuant significantly compresses models. On the LIBERO benchmark, it operates at or below 3 bits-per-weight, retaining 95.0% on OpenVLA-OFT and 94.8% on π₀.₅. It reaches 2.5 bpw at 90.1% on OpenVLA-OFT, compressing the backbone from 14.3 GB to 2.7 GB (5.3×). On a physical UR3 arm, π₀.₅ reduces memory footprint by 2.5× while retaining success rate.
Key takeaway
For robotics engineers deploying Vision-Language-Action models on edge hardware, ActQuant offers a critical solution to memory and latency constraints. You can achieve significant model compression, such as reducing OpenVLA-OFT from 14.3 GB to 2.7 GB, while maintaining high task success rates. Consider integrating ActQuant and OmniModel.cpp to enable efficient sub-4-bit inference directly on your robotic platforms, improving performance and reducing energy consumption.
Key insights
ActQuant enables sub-4-bit quantization for VLA models by action-guided mixed-precision allocation and efficient C/C++ deployment.
Principles
- Quantize based on action contribution.
- Optimize scales with action-aware curvature.
- Preserve hardware-friendly uniform bit-widths.
Method
ActQuant uses HSIC for inter-tensor bit allocation and an Action-Mixed Fisher for intra-tensor scale optimization. OmniModel.cpp converts PyTorch VLA models to native C/C++ with GGML kernels.
In practice
- Deploy VLA models on edge devices.
- Reduce VLA memory footprint by 5.3×.
- Achieve 1.5× inference speedup on GPU.
Topics
- Vision-Language-Action Models
- Post-Training Quantization
- Mixed-Precision Quantization
- Edge AI Deployment
- Robotics Manipulation
- GGML
Best for: AI Engineer, Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.