Mix-QVLA: Task-Evidence-Aware Mixed-Precision Quantization of Vision-Language-Action Models

2026-06-17 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

Mix-QVLA is a novel task-evidence-aware mixed-precision Post-Training Quantization (PTQ) framework designed for Vision-Language-Action (VLA) models. This framework quantizes VLA models by anchoring each variant to a full-precision action-token reference decision, evaluating how quantization preserves task-relevant evidence across VLA functional boundaries. It computes normalized gradient-weighted task-evidence maps from boundary activations, comparing full-precision and quantized maps using evidence-mass and attribution-distribution distortion to capture changes in decision-supporting evidence. A soft-bottleneck objective aggregates boundary-level degradation into layer-wise sensitivity scores, modeling sensitivity throughout task execution to capture phase-dependent shifts in layer importance. These evidence- and time-aware scores then guide mixed-precision bit allocation under model-size and BitOps budgets. Evaluations on OpenVLA-style policies, specifically OpenVLA-OFT on LIBERO, demonstrate that Mix-QVLA reduces memory from 15.4 GB to 4.1 GB, retains 96.3 average success compared to 97.1 for the BF16 model, and achieves a 1.52x inference speedup.

Key takeaway

For Machine Learning Engineers deploying Vision-Language-Action (VLA) models, Mix-QVLA offers a significant pathway to improve efficiency without substantial accuracy loss. If you are struggling with memory constraints or slow inference for models like OpenVLA-OFT, consider applying task-evidence-aware mixed-precision quantization. This approach can reduce your model's memory footprint by over 70% (e.g., from 15.4 GB to 4.1 GB) and boost inference speed by 1.52x, while retaining high success rates (e.g., 96.3%).

Key insights

Mix-QVLA quantizes VLA models by preserving task-relevant evidence across functional boundaries, optimizing accuracy-efficiency.

Principles

Quantization must preserve task-relevant evidence.
Layer sensitivity shifts during task execution.
Mixed-precision allocation needs evidence- and time-awareness.

Method

Mix-QVLA computes gradient-weighted task-evidence maps, compares full-precision and quantized maps for distortion, and aggregates degradation into layer-wise sensitivity scores to guide bit allocation.

In practice

Reduce VLA model memory footprint.
Improve VLA inference speed.
Deploy OpenVLA-style policies efficiently.

Topics

Vision-Language-Action Models
Mixed-Precision Quantization
Post-Training Quantization
Model Compression
Robotics Policies
OpenVLA

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.