Mix-QVLA: Task-Evidence-Aware Mixed-Precision Quantization of Vision-Language-Action Models
Summary
Mix-QVLA is a novel task-evidence-aware mixed-precision Post-Training Quantization (PTQ) framework designed for Vision-Language-Action (VLA) models. This framework quantizes VLA models by anchoring each variant to a full-precision action-token reference decision, evaluating how quantization preserves task-relevant evidence across VLA functional boundaries. It computes normalized gradient-weighted task-evidence maps from boundary activations, comparing full-precision and quantized maps using evidence-mass and attribution-distribution distortion to capture changes in decision-supporting evidence. A soft-bottleneck objective aggregates boundary-level degradation into layer-wise sensitivity scores, modeling sensitivity throughout task execution to capture phase-dependent shifts in layer importance. These evidence- and time-aware scores then guide mixed-precision bit allocation under model-size and BitOps budgets. Evaluations on OpenVLA-style policies, specifically OpenVLA-OFT on LIBERO, demonstrate that Mix-QVLA reduces memory from 15.4 GB to 4.1 GB, retains 96.3 average success compared to 97.1 for the BF16 model, and achieves a 1.52x inference speedup.
Key takeaway
For Machine Learning Engineers deploying Vision-Language-Action (VLA) models, Mix-QVLA offers a significant pathway to improve efficiency without substantial accuracy loss. If you are struggling with memory constraints or slow inference for models like OpenVLA-OFT, consider applying task-evidence-aware mixed-precision quantization. This approach can reduce your model's memory footprint by over 70% (e.g., from 15.4 GB to 4.1 GB) and boost inference speed by 1.52x, while retaining high success rates (e.g., 96.3%).
Key insights
Mix-QVLA quantizes VLA models by preserving task-relevant evidence across functional boundaries, optimizing accuracy-efficiency.
Principles
- Quantization must preserve task-relevant evidence.
- Layer sensitivity shifts during task execution.
- Mixed-precision allocation needs evidence- and time-awareness.
Method
Mix-QVLA computes gradient-weighted task-evidence maps, compares full-precision and quantized maps for distortion, and aggregates degradation into layer-wise sensitivity scores to guide bit allocation.
In practice
- Reduce VLA model memory footprint.
- Improve VLA inference speed.
- Deploy OpenVLA-style policies efficiently.
Topics
- Vision-Language-Action Models
- Mixed-Precision Quantization
- Post-Training Quantization
- Model Compression
- Robotics Policies
- OpenVLA
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.