ActQuant: Sub-4-bit Action-Guided Quantization for Vision-Language-Action Models

2026-06-08 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

ActQuant is a novel action-guided mixed-precision post-training quantization (PTQ) framework designed for Vision-Language-Action (VLA) models, enabling their deployment on edge platforms by achieving sub-4-bit weight quantization. It employs a two-stage process: an inter-tensor bit allocator assigns bit-widths based on action contribution, and an intra-tensor scale optimizer tunes per-block scales using action-aware curvature. Coupled with OmniModel.cpp, an agentic conversion pipeline for native C/C++ runtime, ActQuant significantly compresses models. On the LIBERO benchmark, it operates at or below 3 bits-per-weight, retaining 95.0% on OpenVLA-OFT and 94.8% on π₀.₅. It reaches 2.5 bpw at 90.1% on OpenVLA-OFT, compressing the backbone from 14.3 GB to 2.7 GB (5.3×). On a physical UR3 arm, π₀.₅ reduces memory footprint by 2.5× while retaining success rate.

Key takeaway

For robotics engineers deploying Vision-Language-Action models on edge hardware, ActQuant offers a critical solution to memory and latency constraints. You can achieve significant model compression, such as reducing OpenVLA-OFT from 14.3 GB to 2.7 GB, while maintaining high task success rates. Consider integrating ActQuant and OmniModel.cpp to enable efficient sub-4-bit inference directly on your robotic platforms, improving performance and reducing energy consumption.

Key insights

ActQuant enables sub-4-bit quantization for VLA models by action-guided mixed-precision allocation and efficient C/C++ deployment.

Principles

Quantize based on action contribution.
Optimize scales with action-aware curvature.
Preserve hardware-friendly uniform bit-widths.

Method

ActQuant uses HSIC for inter-tensor bit allocation and an Action-Mixed Fisher for intra-tensor scale optimization. OmniModel.cpp converts PyTorch VLA models to native C/C++ with GGML kernels.

In practice

Deploy VLA models on edge devices.
Reduce VLA memory footprint by 5.3×.
Achieve 1.5× inference speedup on GPU.

Topics

Vision-Language-Action Models
Post-Training Quantization
Mixed-Precision Quantization
Edge AI Deployment
Robotics Manipulation
GGML

Best for: AI Engineer, Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.