AVIS: Adaptive Test-Time Scaling for Vision-Language Models
Summary
AVIS, an Adaptive Visual Inference Scaling policy, addresses the prohibitive inference costs of modern Vision-Language Models (VLMs) that arise from large visual contexts and extensive decoding chains. Unlike existing methods that optimize Visual Context Scaling (VCS) or Visual Reasoning Scaling (VRS) independently, AVIS adaptively scales both axes per query. For VCS, AVIS employs Key Diversity Visual (KDV) pruning, a training-free O(N) key-based rule that removes redundant visual tokens before prefilling. For VRS, it uses adaptive self-consistency, leveraging a learned difficulty predictor to determine the number of reasoning rollouts. AVIS is designed for deployment, supporting shared-prefill inference where rollouts reuse a single prefilling pass and KV cache. Benchmarking across diverse image and video reasoning tasks shows AVIS improves the accuracy-compute trade-off over VCS-only and VRS-only baselines, maintaining effectiveness even with RL post-trained VLMs while minimizing compute and latency.
Key takeaway
For Machine Learning Engineers optimizing Vision-Language Model inference, AVIS offers a critical approach to reduce compute and latency. If you are struggling with high costs from large visual contexts or extensive decoding chains, consider implementing AVIS's adaptive scaling of both visual context and reasoning. This method improves your accuracy-compute trade-off, even with RL post-trained VLMs, by efficiently managing visual tokens and reasoning rollouts.
Key insights
AVIS adaptively scales visual context and reasoning for VLMs, significantly reducing inference cost while improving accuracy.
Principles
- Inference cost in VLMs stems from visual context and reasoning search.
- Jointly optimizing visual context and reasoning improves efficiency.
- Training-free pruning can reduce visual token redundancy.
Method
AVIS uses Key Diversity Visual (KDV) pruning for Visual Context Scaling (VCS) and adaptive self-consistency with a learned difficulty predictor for Visual Reasoning Scaling (VRS).
In practice
- Implement KDV pruning for efficient visual token handling.
- Use adaptive self-consistency to manage reasoning rollouts.
- Deploy AVIS with shared-prefill inference for VLM efficiency.
Topics
- Vision-Language Models
- Inference Optimization
- Adaptive Scaling
- Visual Context Scaling
- Visual Reasoning
- KDV Pruning
- Self-Consistency
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.