Understanding Pruning Regimes in Vision-Language Models Through Domain-Aware Layer Selection

2026-03-24 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

A study by Saeed Khaki, Nima Safaei, and Kamal Ginotra investigates structured decoder layer pruning in Transformer-based Vision-Language Models (VLMs), specifically Qwen3-VL-2B-Instruct and Qwen3-VL-4B-Instruct, to reduce depth redundancy while preserving domain-specific capabilities like mathematical reasoning. The researchers introduce a "domain-aware activation similarity" method, which measures how much each decoder layer transforms representations for math versus non-math inputs. This approach identifies layers least critical to a target domain, yielding math-aware, non-math-aware, and mixed ranking criteria. Experiments across math and general multimodal benchmarks reveal a consistent three-regime pruning structure: low budgets show high sensitivity to layer choice, moderate budgets see methods converge due to structural damage, and high budgets favor spacing-aware strategies. Domain-aware rankings demonstrate superior stability in the ranking-sensitive regime and match or exceed structure-aware baselines at higher budgets, offering an interpretable way to optimize VLM depth.

Key takeaway

For AI Engineers and Research Scientists optimizing VLM deployment costs, this research indicates that employing domain-aware layer pruning can significantly reduce model depth without sacrificing critical capabilities, especially for specialized tasks like mathematical reasoning. You should prioritize domain-aware ranking at low pruning budgets (e.g., 10%) to maintain performance, as layer choice is highly sensitive. For higher budgets, consider strategies that enforce structural continuity to prevent degradation. This approach offers a practical pathway to more efficient, domain-specific VLM deployments.

Key insights

Domain-aware pruning effectively reduces VLM depth by identifying and removing layers least critical to specific tasks.

Principles

VLM decoder layers exhibit domain-specific activation patterns.
Pruning effectiveness varies across three distinct budget regimes.
Input-output activation similarity indicates layer redundancy.

Method

The method involves logging layer input/output activations for domain-specific prompts, computing cosine similarity as a redundancy score, aggregating scores per domain, and ranking layers for pruning, followed by brief supervised fine-tuning.

In practice

Use domain-aware ranking for low pruning budgets.
Prioritize non-math layers for general VLM stability.
Apply post-pruning SFT to restore model robustness.

Topics

Vision-Language Models
Model Pruning
Transformer Architectures
Mathematical Reasoning
Activation Similarity

Best for: AI Engineer, AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.