AVIS: Adaptive Test-Time Scaling for Vision-Language Models

2026-06-10 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

AVIS, an Adaptive Visual Inference Scaling policy, addresses the prohibitive inference costs of modern Vision-Language Models (VLMs) that arise from large visual contexts and extensive decoding chains. Unlike existing methods that optimize Visual Context Scaling (VCS) or Visual Reasoning Scaling (VRS) independently, AVIS adaptively scales both axes per query. For VCS, AVIS employs Key Diversity Visual (KDV) pruning, a training-free O(N) key-based rule that removes redundant visual tokens before prefilling. For VRS, it uses adaptive self-consistency, leveraging a learned difficulty predictor to determine the number of reasoning rollouts. AVIS is designed for deployment, supporting shared-prefill inference where rollouts reuse a single prefilling pass and KV cache. Benchmarking across diverse image and video reasoning tasks shows AVIS improves the accuracy-compute trade-off over VCS-only and VRS-only baselines, maintaining effectiveness even with RL post-trained VLMs while minimizing compute and latency.

Key takeaway

For Machine Learning Engineers optimizing Vision-Language Model inference, AVIS offers a critical approach to reduce compute and latency. If you are struggling with high costs from large visual contexts or extensive decoding chains, consider implementing AVIS's adaptive scaling of both visual context and reasoning. This method improves your accuracy-compute trade-off, even with RL post-trained VLMs, by efficiently managing visual tokens and reasoning rollouts.

Key insights

AVIS adaptively scales visual context and reasoning for VLMs, significantly reducing inference cost while improving accuracy.

Principles

Inference cost in VLMs stems from visual context and reasoning search.
Jointly optimizing visual context and reasoning improves efficiency.
Training-free pruning can reduce visual token redundancy.

Method

AVIS uses Key Diversity Visual (KDV) pruning for Visual Context Scaling (VCS) and adaptive self-consistency with a learned difficulty predictor for Visual Reasoning Scaling (VRS).

In practice

Implement KDV pruning for efficient visual token handling.
Use adaptive self-consistency to manage reasoning rollouts.
Deploy AVIS with shared-prefill inference for VLM efficiency.

Topics

Vision-Language Models
Inference Optimization
Adaptive Scaling
Visual Context Scaling
Visual Reasoning
KDV Pruning
Self-Consistency

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.