AVIS: Adaptive Test-Time Scaling for Vision-Language Models

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

AVIS, an Adaptive Visual Inference Scaling policy, addresses the prohibitive inference costs of modern Vision-Language Models (VLMs) that arise from large visual contexts and extensive decoding chains. Unlike existing methods that optimize Visual Context Scaling (VCS) or Visual Reasoning Scaling (VRS) independently, AVIS adaptively scales both axes per query. For VCS, AVIS employs Key Diversity Visual (KDV) pruning, a training-free O(N) key-based rule that removes redundant visual tokens before prefilling. For VRS, it uses adaptive self-consistency, leveraging a learned difficulty predictor to determine the number of reasoning rollouts. AVIS is designed for deployment, supporting shared-prefill inference where rollouts reuse a single prefilling pass and KV cache. Benchmarking across diverse image and video reasoning tasks shows AVIS improves the accuracy-compute trade-off over VCS-only and VRS-only baselines, maintaining effectiveness even with RL post-trained VLMs while minimizing compute and latency.

Key takeaway

For Machine Learning Engineers optimizing Vision-Language Model inference, AVIS offers a critical approach to reduce compute and latency. If you are struggling with high costs from large visual contexts or extensive decoding chains, consider implementing AVIS's adaptive scaling of both visual context and reasoning. This method improves your accuracy-compute trade-off, even with RL post-trained VLMs, by efficiently managing visual tokens and reasoning rollouts.

Key insights

AVIS adaptively scales visual context and reasoning for VLMs, significantly reducing inference cost while improving accuracy.

Principles

Method

AVIS uses Key Diversity Visual (KDV) pruning for Visual Context Scaling (VCS) and adaptive self-consistency with a learned difficulty predictor for Visual Reasoning Scaling (VRS).

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.