Learning Adaptive Reasoning Paths for Efficient Visual Reasoning

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision, Natural Language Processing · Depth: Expert, quick

Summary

Visual reasoning models (VRMs) frequently generate excessively long reasoning chains, a problem termed "Reasoning Path Redundancy." To mitigate this, a new framework called AVR (Adaptive Visual Reasoning) has been developed. AVR decomposes visual reasoning into three core cognitive functions: visual perception, logical reasoning, and answer application. It allows models to dynamically select from three response formats: Full Format, Perception-Only Format, and Direct Answer, based on task complexity. AVR is trained using FS-GRPO, a modified Group Relative Policy Optimization, which optimizes for reasoning efficiency without sacrificing accuracy. Evaluations on several vision-language benchmarks demonstrate that AVR reduces token usage by 50-90% while maintaining accuracy, particularly in tasks heavily reliant on perception.

Key takeaway

For AI Engineers developing visual reasoning models, consider integrating adaptive reasoning frameworks like AVR. This approach can drastically cut token usage by 50-90% without compromising accuracy, especially in perception-heavy tasks, leading to more efficient and cost-effective model deployments. Explore the provided code and data to adapt AVR's principles to your specific VRM architectures.

Key insights

Adaptive visual reasoning can significantly reduce token usage in VRMs by dynamically selecting optimal reasoning paths.

Principles

Method

AVR decomposes visual reasoning into perception, logical reasoning, and answer application, then uses FS-GRPO to train models to dynamically select among Full, Perception-Only, or Direct Answer formats for efficiency.

In practice

Topics

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.